语法函数2-数据处理dplyr包#

读取数据#

  • 在学习R语言过程中,除了使用内置的数据集外,我们更多的需要导入外部数据

数据科学中的文件管理#

  • 把项目所需的文件(代码、数据、图片等),放在一个文件夹里

  • 读取文件

    • 事实上,R语言提供了很多读取数据的函数。下表列出了常见文件格式的读取方法

文件格式

R函数

txt

read.table()

.csv

read.csv() and readr::read_csv()

.xls and .xlsx

readxl::read_excel() and openxlsx::read.xlsx()

.sav(SPSS files)

haven::read_sav() and foreign::read.spss()

.Rdata or rda

load()

.rds

readRDS() and readr::read_rds()

.dta

haven::read_dta() and haven::read_stata()

.sas7bdat(SAS files)

haven::read_sas()

Internet

download.file()

here宏包#

  • 强大的here宏包,here()会告诉我们当前所在的目录

  • 指向某个文件的路径信息

  • here宏包的好处还在于,在不同的电脑和文件结构下,代码都能运行

here::here()
'/public/home/sll/mybook/content'
here::here("software", "smoove")
d <- read.table(file= "./data/txt_file.txt", header = TRUE)
'/public/home/sll/mybook/content/software/smoove'
Warning message in file(file, "rt"):
“无法打开文件'./data/txt_file.txt': 没有那个文件或目录”
Error in file(file, "rt"): 无法打开链结
Traceback:

1. read.table(file = "./data/txt_file.txt", header = TRUE)
2. file(file, "rt")

数据处理#

  • 数据处理的工具dplyr宏包–tidyverse里的“瑞士军刀”

  • dplyr定义了数据处理的规范语法,其中主要包含以下10个主要的函数

    • mutate(), select(), rename() , filter()

    • summarise(), group_by(), arrange()

    • left_join(), right_join()full_join()

  • if_else() case_when() everything()函数

  • count()

    • 对给定条件的某列进行计数,count(x, condition)相当于filter(x,condition) %>%  group_by(x) %>%  summarise(n=n())

  • across()

    • 对给定的某些列赋予函数计算

    • across(.cols = everything(), .fns = NULL, ..., .names = NULL)

      • 用在 mutate()summarise() 函数里面

      • across()多列执行相同的函数操作,返回数据框

      • list()列出多个要使用的函数及对应列名

library(dplyr)
载入程辑包:‘dplyr’
The following objects are masked from ‘package:stats’:

    filter, lag
The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
df <- data.frame(
name=c("Alice", "Alice", "Bob", "Bob", "Carol", "Carol"),
type=c("english", "math", "english", "math", "english", "math"),
score=c(80, 60, 70, 69, 80, 90))
df
A data.frame: 6 × 3
nametypescore
<chr><chr><dbl>
Aliceenglish80
Alicemath 60
Bob english70
Bob math 69
Carolenglish80
Carolmath 90

1 新增一列 mutata()#

  • mutate()函数的功能是给数据框新增一列,使用语法为 mutate(.data = df, name = value)

    • 第一个参数.data,接受要处理的数据框,比如这里的df

    • 第二个参数是Name-value对, 比如extra = reward

      • 等号左边,是我们为新增的一列取的名字,比如这里的extra,因为数据框每一列都是要有名字的;

      • 等号右边,是打算并入数据框的向量,比如这里的reward,它是装着学生成绩的向量。注意,向量的长度,

        • 要么与数据框的行数等长,比如这里向量长度为6;

        • 要么长度为1,即,新增的这一列所有的值都是一样的(循环补齐机制)。

image.png

  • 因为mutate()函数处理的是数据框,并且固定放置在第一位置上(几乎所有dplyr的函数都是这样要求的),所以这个.data可以偷懒不写,直接写mutate(df, extra = reward)。另外,如果想同时新增多个列,只需要提供多个Name-value对即可

reward <- c(2, 5, 9, 8, 5, 6)
mutate(df, extra=reward)
A data.frame: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish802
Alicemath 605
Bob english709
Bob math 698
Carolenglish805
Carolmath 906
# 新增多列
mutate(df,
      extra1 = c(2, 5, 9, 8, 5, 6),
      extra2 = c(1, 2, 3, 3, 2, 1),
      extra3 = c(8)
      )
A data.frame: 6 × 6
nametypescoreextra1extra2extra3
<chr><chr><dbl><dbl><dbl><dbl>
Aliceenglish80218
Alicemath 60528
Bob english70938
Bob math 69838
Carolenglish80528
Carolmath 90618

2 管道 %>%#

  • Windows系统中可以通过Ctrl + Shift + M 快捷键产生 %>%

c(1:10) %>% sum()
55
  • 这条语句的意思是f(x) 写成 x %>% f(),这里向量 c(1:10) 通过管道操作符 %>% ,传递到函数sum()的第一个参数位置,即sum(c(1:10))

  • image.png

df %>% mutate(extra=reward)
A data.frame: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish802
Alicemath 605
Bob english709
Bob math 698
Carolenglish805
Carolmath 906

3 向量函数与mutate()#

  • mutate()函数的本质还是向量函数和向量化操作,只不过是换作在数据框中完成,这样更能形成“数据框进、数据框出”的思维,方便快捷地构思并统计任务

  • mutate()中引用数据框的某一列名,实际上是引用了列名对应的整个向量, 所以,这里我们传递scorecalc_square(),就是把整个score向量传递给calc_square().

    • image.png

  • 几何算符(这里是平方)是向量化的,因此calc_square()会对输入的score向量,返回一个等长的向量。

    • image-2.png

  • mutate()拿到这个新的向量后,就在原有数据框中添加新的一列new_col

    • image-3.png

calc_square <- function(x){
    x^2
}
df %>% mutate(new_col = calc_square(score))
A data.frame: 6 × 4
nametypescorenew_col
<chr><chr><dbl><dbl>
Aliceenglish806400
Alicemath 603600
Bob english704900
Bob math 694761
Carolenglish806400
Carolmath 908100

4 保存为新的数据框#

df_new <- df %>% 
  mutate(extra = reward) %>% 
  mutate(total = score + extra)
df_new

df_new2 <- df %>% 
  mutate(extra = reward,
        total = score + extra)
df_new2
A data.frame: 6 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Aliceenglish80282
Alicemath 60565
Bob english70979
Bob math 69877
Carolenglish80585
Carolmath 90696
A data.frame: 6 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Aliceenglish80282
Alicemath 60565
Bob english70979
Bob math 69877
Carolenglish80585
Carolmath 90696

5 选取列 select()#

  • select()顾名思义选择,就是选择数据框的某一列,或者某几列。

  • 数据框进数据框出是dplyr函数的第二个特点

  • 删除某列,可以在变量前面加 - 或者 !,两者的结果是一样的。

  • 也可以通过位置索引进行选取

  • 如果要选取数据框的列很多,我们也可以先观察列名的特征,用特定的函数进行选取

df_new %>% select(name, extra)
A data.frame: 6 × 2
nameextra
<chr><dbl>
Alice2
Alice5
Bob 9
Bob 8
Carol5
Carol6
df_new %>% select(-name)
A data.frame: 6 × 4
typescoreextratotal
<chr><dbl><dbl><dbl>
english80282
math 60565
english70979
math 69877
english80585
math 90696
df_new %>% select(1,2,3)
df_new %>% select(1:3)
A data.frame: 6 × 3
nametypescore
<chr><chr><dbl>
Aliceenglish80
Alicemath 60
Bob english70
Bob math 69
Carolenglish80
Carolmath 90
A data.frame: 6 × 3
nametypescore
<chr><chr><dbl>
Aliceenglish80
Alicemath 60
Bob english70
Bob math 69
Carolenglish80
Carolmath 90
# 如果要选取数据框的列很多,我们也可以先观察列名的特征,用特定的函数进行选取
df_new %>% select(starts_with("s"))
df_new %>% select(ends_with("e"))
df_new %>% select(contains("score"))  # 列名包含
df_new %>% select(where(is.character)) # 通过变量的类型来选取
df_new %>% select(where(is.numeric))
df_new %>% select(where(is.numeric) & starts_with("t"))
df_new %>% select(!starts_with("s"))
A data.frame: 6 × 1
score
<dbl>
80
60
70
69
80
90
A data.frame: 6 × 3
nametypescore
<chr><chr><dbl>
Aliceenglish80
Alicemath 60
Bob english70
Bob math 69
Carolenglish80
Carolmath 90
A data.frame: 6 × 1
score
<dbl>
80
60
70
69
80
90
A data.frame: 6 × 2
nametype
<chr><chr>
Aliceenglish
Alicemath
Bob english
Bob math
Carolenglish
Carolmath
A data.frame: 6 × 3
scoreextratotal
<dbl><dbl><dbl>
80282
60565
70979
69877
80585
90696
A data.frame: 6 × 1
total
<dbl>
82
65
79
77
85
96
A data.frame: 6 × 4
nametypeextratotal
<chr><chr><dbl><dbl>
Aliceenglish282
Alicemath 565
Bob english979
Bob math 877
Carolenglish585
Carolmath 696

6 修改列名 rename()#

  • rename()修改列的名字, 具体方法是rename(.data, new_name = old_name),和mutate()一样,等号左边是新的变量名,右边是已经存在的变量名(这是dplyr函数的第三个特征)

df_new %>% 
  select(name, type, total) %>% 
  rename(total_score = total)
A data.frame: 6 × 3
nametypetotal_score
<chr><chr><dbl>
Aliceenglish82
Alicemath 65
Bob english79
Bob math 77
Carolenglish85
Carolmath 96

7 筛选filter()#

  • 前面select()列方向的选择,而用filter()函数,我们可以对数据框行方向进行筛选,选出符合特定条件的某些行

  • image.png

  • 注意,这里filter()函数不是字面上“过滤掉”的意思,而是保留符合条件的行,也就说keep,不是drop的意思。

  • R提供了其他比较关系的算符: <, >, <=, >=, == (equal), != (not equal), %in%, is.na()!is.na() .

  • 可以限定多个条件进行筛选,支持逻辑运算符

  • 可以配合一些函数使用

filter()中的逻辑运算符

Operator

Meaning

==

Equal to

>

Greater than

<

Less than

>=

Greater than or equal to

<=

Less than or equal to

!=

Not equal to

%in%

in

is.na

is a missing value (NA)

!is.na

is not a missing value

&

and

|

or

df_new %>% filter(score > 70)
A data.frame: 3 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Aliceenglish80282
Carolenglish80585
Carolmath 90696
df_new %>% filter(type=="english" & score >= 75)
df_new %>% filter(type=="english", score >= 75)
A data.frame: 2 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Aliceenglish80282
Carolenglish80585
A data.frame: 2 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Aliceenglish80282
Carolenglish80585
df_new %>% filter(score == max(score))
A data.frame: 1 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Carolmath90696

8 统计汇总summarise()#

  • summarise()函数非常强大,主要用于统计汇总,往往与其他函数配合使用

df_new %>% summarise(mean_score = mean(score))
df_new %>% summarise(sd_score = sd(score))
A data.frame: 1 × 1
mean_score
<dbl>
74.83333
A data.frame: 1 × 1
sd_score
<dbl>
10.59088
df_new %>% summarise(
  mean_score = mean(score),
  median_score = median(score),
  n = n(),
  sum = sum(score))
A data.frame: 1 × 4
mean_scoremedian_scorensum
<dbl><dbl><int><dbl>
74.83333756449

9 分组统计 group_by()#

  • 实际运用中,summarise()函数往往配合group_by()一起使用,即,先分组再统计

df_new %>% 
  group_by(name) %>% 
  summarise(
    mean_score = mean(total),
    sd_score = sd(total))
A tibble: 3 × 3
namemean_scoresd_score
<chr><dbl><dbl>
Alice73.512.020815
Bob 78.0 1.414214
Carol90.5 7.778175

10 排序 arrange()#

  • arrange()就是按照某个变量进行排序,默认为从小到大排序

  • 在要排序的变量前加上-可改成从大到小排序

  • 使用desc()函数也可实现从大到小排序

  • 也可对多个变量依次排序

df_new %>% arrange(total)
df_new %>% arrange(-total)
df_new %>% arrange(desc(total))
A data.frame: 6 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Alicemath 60565
Bob math 69877
Bob english70979
Aliceenglish80282
Carolenglish80585
Carolmath 90696
A data.frame: 6 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Carolmath 90696
Carolenglish80585
Aliceenglish80282
Bob english70979
Bob math 69877
Alicemath 60565
A data.frame: 6 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Carolmath 90696
Carolenglish80585
Aliceenglish80282
Bob english70979
Bob math 69877
Alicemath 60565
df_new %>% 
  arrange(type, desc(total))
A data.frame: 6 × 5
nametypescoreextratotal
<chr><chr><dbl><dbl><dbl>
Carolenglish80585
Aliceenglish80282
Bob english70979
Carolmath 90696
Bob math 69877
Alicemath 60565

11 左联结 left_join()#

  • 实际操作中,会遇到数据框合并的情形

  • left_join(df1, df2, by = "name"),by指定通过哪一列进行合并,合并按照df1进行

df1 <- df_new %>% 
  group_by(name) %>% 
  summarise(
    mean_score = mean(total)
  )
df1

df2 <- tibble(
       name = c("Alice", "Bob", "Dave"),
       age =  c(12, 13, 14)
)
df2
# 使用 left_join()函数 把两个数据框df1和df2合并连接在一起
left_join(df1, df2, by = "name")
df1 %>% left_join(df2, by = "name")
A tibble: 3 × 2
namemean_score
<chr><dbl>
Alice73.5
Bob 78.0
Carol90.5
A tibble: 3 × 2
nameage
<chr><dbl>
Alice12
Bob 13
Dave 14
A tibble: 3 × 3
namemean_scoreage
<chr><dbl><dbl>
Alice73.512
Bob 78.013
Carol90.5NA
A tibble: 3 × 3
namemean_scoreage
<chr><dbl><dbl>
Alice73.512
Bob 78.013
Carol90.5NA

12 右联结 right_join()#

  • right_join(df1, df2, by = "name"),by指定通过哪一列进行合并,合并按照df2进行,没有对应信息计为NA

df1 %>% right_join(df2, by = "name")
A tibble: 3 × 3
namemean_scoreage
<chr><dbl><dbl>
Alice73.512
Bob 78.013
Dave NA14

13 满联结 full_join()#

  • 有时候,我们不想丢失项,可以使用full_join(),该函数确保条目是完整的,信息缺失的地方为NA

df1 %>% full_join(df2, by = "name")
A tibble: 4 × 3
namemean_scoreage
<chr><dbl><dbl>
Alice73.512
Bob 78.013
Carol90.5NA
Dave NA14

14 内联结inner_join()#

  • 只保留name条目相同地记录

df1 %>% inner_join(df2, by = "name")
A tibble: 2 × 3
namemean_scoreage
<chr><dbl><dbl>
Alice73.512
Bob 78.013

15 筛选联结 semi_join(x,y) anti_join(x,y)#

  • 筛选联结,有两个semi_join(x, y)anti_join(x, y)

  • 函数不改变数据框x的变量的数量,主要影响的是x的观测,也就说会剔除一些行,其功能类似filter()

  • 半联结semi_join(x, y),保留nameyname相一致的所有行,可以看作对x的筛选

  • 反联结anti_join(x, y),丢弃nameyname相一致的所有行

df1 %>% semi_join(df2, by="name")
df1 %>% filter(
  name %in% df2$name)
A tibble: 2 × 2
namemean_score
<chr><dbl>
Alice73.5
Bob 78.0
A tibble: 2 × 2
namemean_score
<chr><dbl>
Alice73.5
Bob 78.0
df1 %>% anti_join(df2, by="name")
df1 %>% filter(
  ! name %in% df2$name)
A tibble: 1 × 2
namemean_score
<chr><dbl>
Carol90.5
A tibble: 1 × 2
namemean_score
<chr><dbl>
Carol90.5
df %>% 
  group_by(name) %>% 
  summarise(mean_score = mean(score))

df %>% 
  group_by(name) %>% 
  mutate(mean_score = mean(score))
A tibble: 3 × 2
namemean_score
<chr><dbl>
Alice70.0
Bob 69.5
Carol85.0
A grouped_df: 6 × 4
nametypescoremean_score
<chr><chr><dbl><dbl>
Aliceenglish8070.0
Alicemath 6070.0
Bob english7069.5
Bob math 6969.5
Carolenglish8085.0
Carolmath 9085.0

dplyr进阶#

导入数据#

library(tidyverse)
library(palmerpenguins)
── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
 forcats   1.0.0      readr     2.1.5
 ggplot2   3.5.0      stringr   1.5.1
 lubridate 1.9.3      tibble    3.2.1
 purrr     1.0.2      tidyr     1.3.1
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
penguins <- penguins %>% drop_na() 
penguins %>% select(bill_length_mm, bill_depth_mm)
penguins %>% select(starts_with("bill_"))
A tibble: 333 × 2
bill_length_mmbill_depth_mm
<dbl><dbl>
39.118.7
39.517.4
40.318.0
36.719.3
39.320.6
38.917.8
39.219.6
41.117.6
38.621.2
34.621.1
36.617.8
38.719.0
42.520.7
34.418.4
46.021.5
37.818.3
37.718.7
35.919.2
38.218.1
38.817.2
35.318.9
40.618.6
40.517.9
37.918.6
40.518.9
39.516.7
37.218.1
39.517.8
40.918.9
36.417.0
46.916.6
53.519.9
49.019.5
46.217.5
50.919.1
45.517.0
50.917.9
50.818.5
50.117.9
49.019.6
51.518.7
49.817.3
48.116.4
51.419.0
45.717.3
50.719.7
42.517.3
52.218.8
45.216.6
49.319.9
50.218.8
45.619.4
51.919.5
46.816.5
45.717.0
55.819.8
43.518.1
49.618.2
50.819.0
50.218.7
A tibble: 333 × 2
bill_length_mmbill_depth_mm
<dbl><dbl>
39.118.7
39.517.4
40.318.0
36.719.3
39.320.6
38.917.8
39.219.6
41.117.6
38.621.2
34.621.1
36.617.8
38.719.0
42.520.7
34.418.4
46.021.5
37.818.3
37.718.7
35.919.2
38.218.1
38.817.2
35.318.9
40.618.6
40.517.9
37.918.6
40.518.9
39.516.7
37.218.1
39.517.8
40.918.9
36.417.0
46.916.6
53.519.9
49.019.5
46.217.5
50.919.1
45.517.0
50.917.9
50.818.5
50.117.9
49.019.6
51.518.7
49.817.3
48.116.4
51.419.0
45.717.3
50.719.7
42.517.3
52.218.8
45.216.6
49.319.9
50.218.8
45.619.4
51.919.5
46.816.5
45.717.0
55.819.8
43.518.1
49.618.2
50.819.0
50.218.7
penguins %>% select(where(is.numeric))
penguins %>% select(!where(is.character))
A tibble: 333 × 5
bill_length_mmbill_depth_mmflipper_length_mmbody_mass_gyear
<dbl><dbl><int><int><int>
39.118.718137502007
39.517.418638002007
40.318.019532502007
36.719.319334502007
39.320.619036502007
38.917.818136252007
39.219.619546752007
41.117.618232002007
38.621.219138002007
34.621.119844002007
36.617.818537002007
38.719.019534502007
42.520.719745002007
34.418.418433252007
46.021.519442002007
37.818.317434002007
37.718.718036002007
35.919.218938002007
38.218.118539502007
38.817.218038002007
35.318.918738002007
40.618.618335502007
40.517.918732002007
37.918.617231502007
40.518.918039502007
39.516.717832502007
37.218.117839002007
39.517.818833002007
40.918.918439002007
36.417.019533252007
46.916.619227002008
53.519.920545002008
49.019.521039502008
46.217.518736502008
50.919.119635502008
45.517.019635002008
50.917.919636752009
50.818.520144502009
50.117.919034002009
49.019.621243002009
51.518.718732502009
49.817.319836752009
48.116.419933252009
51.419.020139502009
45.717.319336002009
50.719.720340502009
42.517.318733502009
52.218.819734502009
45.216.619132502009
49.319.920340502009
50.218.820238002009
45.619.419435252009
51.919.520639502009
46.816.518936502009
45.717.019536502009
55.819.820740002009
43.518.120234002009
49.618.219337752009
50.819.021041002009
50.218.719837752009
A tibble: 333 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male 2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen38.621.21913800male 2007
AdelieTorgersen34.621.11984400male 2007
AdelieTorgersen36.617.81853700female2007
AdelieTorgersen38.719.01953450female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen34.418.41843325female2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 37.818.31743400female2007
AdelieBiscoe 37.718.71803600male 2007
AdelieBiscoe 35.919.21893800female2007
AdelieBiscoe 38.218.11853950male 2007
AdelieBiscoe 38.817.21803800male 2007
AdelieBiscoe 35.318.91873800female2007
AdelieBiscoe 40.618.61833550male 2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 37.918.61723150female2007
AdelieBiscoe 40.518.91803950male 2007
AdelieDream 39.516.71783250female2007
AdelieDream 37.218.11783900male 2007
AdelieDream 39.517.81883300female2007
AdelieDream 40.918.91843900male 2007
AdelieDream 36.417.01953325female2007
ChinstrapDream46.916.61922700female2008
ChinstrapDream53.519.92054500male 2008
ChinstrapDream49.019.52103950male 2008
ChinstrapDream46.217.51873650female2008
ChinstrapDream50.919.11963550male 2008
ChinstrapDream45.517.01963500female2008
ChinstrapDream50.917.91963675female2009
ChinstrapDream50.818.52014450male 2009
ChinstrapDream50.117.91903400female2009
ChinstrapDream49.019.62124300male 2009
ChinstrapDream51.518.71873250male 2009
ChinstrapDream49.817.31983675female2009
ChinstrapDream48.116.41993325female2009
ChinstrapDream51.419.02013950male 2009
ChinstrapDream45.717.31933600female2009
ChinstrapDream50.719.72034050male 2009
ChinstrapDream42.517.31873350female2009
ChinstrapDream52.218.81973450male 2009
ChinstrapDream45.216.61913250female2009
ChinstrapDream49.319.92034050male 2009
ChinstrapDream50.218.82023800male 2009
ChinstrapDream45.619.41943525female2009
ChinstrapDream51.919.52063950male 2009
ChinstrapDream46.816.51893650female2009
ChinstrapDream45.717.01953650female2009
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009
# 多种组合选择
penguins %>% select(species, starts_with("bill_"))
A tibble: 333 × 3
speciesbill_length_mmbill_depth_mm
<fct><dbl><dbl>
Adelie39.118.7
Adelie39.517.4
Adelie40.318.0
Adelie36.719.3
Adelie39.320.6
Adelie38.917.8
Adelie39.219.6
Adelie41.117.6
Adelie38.621.2
Adelie34.621.1
Adelie36.617.8
Adelie38.719.0
Adelie42.520.7
Adelie34.418.4
Adelie46.021.5
Adelie37.818.3
Adelie37.718.7
Adelie35.919.2
Adelie38.218.1
Adelie38.817.2
Adelie35.318.9
Adelie40.618.6
Adelie40.517.9
Adelie37.918.6
Adelie40.518.9
Adelie39.516.7
Adelie37.218.1
Adelie39.517.8
Adelie40.918.9
Adelie36.417.0
Chinstrap46.916.6
Chinstrap53.519.9
Chinstrap49.019.5
Chinstrap46.217.5
Chinstrap50.919.1
Chinstrap45.517.0
Chinstrap50.917.9
Chinstrap50.818.5
Chinstrap50.117.9
Chinstrap49.019.6
Chinstrap51.518.7
Chinstrap49.817.3
Chinstrap48.116.4
Chinstrap51.419.0
Chinstrap45.717.3
Chinstrap50.719.7
Chinstrap42.517.3
Chinstrap52.218.8
Chinstrap45.216.6
Chinstrap49.319.9
Chinstrap50.218.8
Chinstrap45.619.4
Chinstrap51.919.5
Chinstrap46.816.5
Chinstrap45.717.0
Chinstrap55.819.8
Chinstrap43.518.1
Chinstrap49.618.2
Chinstrap50.819.0
Chinstrap50.218.7
## 返回向量
my_tibble[["x"]]
my_tibble$x
my_tibble %>% pull(x)
Error in eval(expr, envir, enclos): 找不到对象'my_tibble'
Traceback:
# 返回数据框
my_tibble["x"]
my_tibble %>% select(x)
Error in eval(expr, envir, enclos): 找不到对象'my_tibble'
Traceback:
tb <- tibble(
  x = 1:5,
  y = 0,
  z = 5:1,
  w = 0
)
tb
A tibble: 5 × 4
xyzw
<int><dbl><int><dbl>
1050
2040
3030
4020
5010
myfun <- function(x) sum(x) == 0

tb %>% select(where(myfun))

tb %>% select(where(~sum(.x) == 0))
A tibble: 5 × 2
yw
<dbl><dbl>
00
00
00
00
00
A tibble: 5 × 2
yw
<dbl><dbl>
00
00
00
00
00
df <- tibble(
  x = c(NA, NA, NA),
  y = c(2, 3, NA),
  z = c(NA, 5, NA)
)
df

df %>% select(where(~ !all(is.na(.x))))
df %>% filter(if_any(everything(), ~ !is.na(.x)))
A tibble: 3 × 3
xyz
<lgl><dbl><dbl>
NA 2NA
NA 3 5
NANANA
A tibble: 3 × 2
yz
<dbl><dbl>
2NA
3 5
NANA
A tibble: 2 × 3
xyz
<lgl><dbl><dbl>
NA2NA
NA3 5
penguins %>% filter(species %in% c("Adelie", "Gentoo"))
penguins %>% filter(species == "Adelie" & bill_length_mm > 40)
A tibble: 265 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male 2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen38.621.21913800male 2007
AdelieTorgersen34.621.11984400male 2007
AdelieTorgersen36.617.81853700female2007
AdelieTorgersen38.719.01953450female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen34.418.41843325female2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 37.818.31743400female2007
AdelieBiscoe 37.718.71803600male 2007
AdelieBiscoe 35.919.21893800female2007
AdelieBiscoe 38.218.11853950male 2007
AdelieBiscoe 38.817.21803800male 2007
AdelieBiscoe 35.318.91873800female2007
AdelieBiscoe 40.618.61833550male 2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 37.918.61723150female2007
AdelieBiscoe 40.518.91803950male 2007
AdelieDream 39.516.71783250female2007
AdelieDream 37.218.11783900male 2007
AdelieDream 39.517.81883300female2007
AdelieDream 40.918.91843900male 2007
AdelieDream 36.417.01953325female2007
GentooBiscoe52.217.12285400male 2009
GentooBiscoe45.514.52124750female2009
GentooBiscoe49.516.12245650male 2009
GentooBiscoe44.514.72144850female2009
GentooBiscoe50.815.72265200male 2009
GentooBiscoe49.415.82164925male 2009
GentooBiscoe46.914.62224875female2009
GentooBiscoe48.414.42034625female2009
GentooBiscoe51.116.52255250male 2009
GentooBiscoe48.515.02194850female2009
GentooBiscoe55.917.02285600male 2009
GentooBiscoe47.215.52154975female2009
GentooBiscoe49.115.02285500male 2009
GentooBiscoe46.816.12155500male 2009
GentooBiscoe41.714.72104700female2009
GentooBiscoe53.415.82195500male 2009
GentooBiscoe43.314.02084575female2009
GentooBiscoe48.115.12095500male 2009
GentooBiscoe50.515.22165000female2009
GentooBiscoe49.815.92295950male 2009
GentooBiscoe43.515.22134650female2009
GentooBiscoe51.516.32305500male 2009
GentooBiscoe46.214.12174375female2009
GentooBiscoe55.116.02305850male 2009
GentooBiscoe48.816.22226000male 2009
GentooBiscoe47.213.72144925female2009
GentooBiscoe46.814.32154850female2009
GentooBiscoe50.415.72225750male 2009
GentooBiscoe45.214.82125200female2009
GentooBiscoe49.916.12135400male 2009
A tibble: 50 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 40.618.61833550male 2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 40.518.91803950male 2007
AdelieDream 40.918.91843900male 2007
AdelieDream 42.218.51803550female2007
AdelieDream 40.818.41953900male 2007
AdelieDream 44.119.71964400male 2007
AdelieDream 41.119.01823425male 2007
AdelieDream 42.321.21914150male 2007
AdelieBiscoe 40.118.91884300male 2008
AdelieBiscoe 42.019.52004050male 2008
AdelieBiscoe 41.418.61913700male 2008
AdelieBiscoe 40.618.81933800male 2008
AdelieBiscoe 41.321.11954400male 2008
AdelieBiscoe 41.118.21924050male 2008
AdelieBiscoe 41.618.01923950male 2008
AdelieBiscoe 41.119.11884100male 2008
AdelieTorgersen41.819.41984450male 2008
AdelieTorgersen45.818.91974150male 2008
AdelieTorgersen42.818.51954250male 2008
AdelieTorgersen40.916.81913700female2008
AdelieTorgersen42.119.11954000male 2008
AdelieTorgersen42.917.61964700male 2008
AdelieDream 41.320.31943550male 2008
AdelieDream 41.118.12054300male 2008
AdelieDream 40.818.92084300male 2008
AdelieDream 40.318.51964350male 2008
AdelieDream 43.218.51924100male 2008
AdelieBiscoe 41.020.02034725male 2009
AdelieBiscoe 43.219.01974775male 2009
AdelieBiscoe 45.620.31914600male 2009
AdelieBiscoe 42.219.51974275male 2009
AdelieBiscoe 42.718.31964075male 2009
AdelieTorgersen41.118.61893325male 2009
AdelieTorgersen40.217.01763450female2009
AdelieTorgersen41.418.52023875male 2009
AdelieTorgersen40.619.01994000male 2009
AdelieTorgersen41.518.31954300male 2009
AdelieTorgersen44.118.02104000male 2009
AdelieTorgersen43.119.21973500male 2009
AdelieDream 41.117.51903900male 2009
AdelieDream 40.220.12003975male 2009
AdelieDream 40.217.11933400female2009
AdelieDream 40.617.21873475male 2009
AdelieDream 40.717.01903725male 2009
AdelieDream 41.518.52014000male 2009
penguins %>%
  filter(species == "Adelie", bill_length_mm == max(bill_length_mm))
A tibble: 0 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>

更多应用#

  • head()

  • tail()

  • slice() 取表格前几行,可与group_by()等其他函数连用

penguins %>% head()
penguins %>% tail()
A tibble: 6 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007
AdelieTorgersen38.917.81813625female2007
A tibble: 6 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
ChinstrapDream45.717.01953650female2009
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009
penguins %>% slice(1)
A tibble: 1 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male2007
penguins %>% 
  group_by(species) %>% 
  slice(1)
A grouped_df: 3 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
Adelie Torgersen39.118.71813750male 2007
ChinstrapDream 46.517.91923500female2007
Gentoo Biscoe 46.113.22114500female2007
penguins %>% filter(bill_length_mm == max(bill_length_mm))

penguins %>% 
  arrange(desc(bill_length_mm)) %>% 
  slice(1)

penguins %>% 
  slice_max(bill_length_mm)
A tibble: 1 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
GentooBiscoe59.6172306050male2007
A tibble: 1 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
GentooBiscoe59.6172306050male2007
A tibble: 1 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
GentooBiscoe59.6172306050male2007

separate#

  • 按照某个分隔符分开某一列

tb <- tibble::tribble(
  ~day, ~price,
  1,   "30-45",
  2,   "40-95",
  3,   "89-65",
  4,   "45-63",
  5,   "52-42"
)
tb
tb1 <- tb %>% 
  separate(price, into=c("low", "high"), sep="-")
tb1
A tibble: 5 × 2
dayprice
<dbl><chr>
130-45
240-95
389-65
445-63
552-42
A tibble: 5 × 3
daylowhigh
<dbl><chr><chr>
13045
24095
38965
44563
55242

unite#

  • 将指定的列按照指定的连接符连接组成新的列

  • remove=FALSE,不移除被连接的都列

tb1 %>% unite(col="prics", c(low, high), sep=":", remove=FALSE)
tb1 %>% unite("prics", c(low, high), sep=":")
A tibble: 5 × 4
daypricslowhigh
<dbl><chr><chr><chr>
130:453045
240:954095
389:658965
445:634563
552:425242
A tibble: 5 × 2
dayprics
<dbl><chr>
130:45
240:95
389:65
445:63
552:42

distinct#

  • distinct()处理的对象是data.frame

  • 功能是筛选不重复的row;返回data.frame

  • n_distinct()处理的对象是vector;功能是统计不同的元素有多少个;返回一个数值

df <- tibble::tribble(
  ~x, ~y, ~z,
  1, 1, 1,
  1, 1, 2,
  1, 1, 1,
  2, 1, 2,
  1, 1, 3,
  3, 3, 1)
df
A tibble: 6 × 3
xyz
<dbl><dbl><dbl>
111
112
111
212
113
331
df %>% distinct()
df %>% distinct(x)
df %>% distinct(x, y)
df %>% distinct(x, y, .keep_all = TRUE) # 只保留最先出现的row
A tibble: 5 × 3
xyz
<dbl><dbl><dbl>
111
112
212
113
331
A tibble: 3 × 1
x
<dbl>
1
2
3
A tibble: 3 × 2
xy
<dbl><dbl>
11
21
33
A tibble: 3 × 3
xyz
<dbl><dbl><dbl>
111
212
331
df %>% distinct(
  across(c(x, y)),
  .keep_all = TRUE)

df %>%
  group_by(x) %>%
  distinct(y, .keep_all = TRUE)
A tibble: 3 × 3
xyz
<dbl><dbl><dbl>
111
212
331
A grouped_df: 3 × 3
xyz
<dbl><dbl><dbl>
111
212
331
df %>% group_by(x) %>% summarise(n = n_distinct(z))
A tibble: 3 × 2
xn
<dbl><int>
13
21
31

有关NA的计算#

  • NA很讨厌,凡是它参与的四则运算,结果都是NA

  • 所以需要事先把它删除,增加参数说明 na.rm = TRUE

sum(c(1, 2, NA, 4))
sum(c(1, 2, NA, 4), na.rm=TRUE)
<NA>
7
penguins %>% 
  mutate(
    body = if_else(body_mass_g > 4200, "you are fat", "you are fine"))
A tibble: 333 × 9
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyearbody
<fct><fct><dbl><dbl><int><int><fct><int><chr>
AdelieTorgersen39.118.71813750male 2007you are fine
AdelieTorgersen39.517.41863800female2007you are fine
AdelieTorgersen40.318.01953250female2007you are fine
AdelieTorgersen36.719.31933450female2007you are fine
AdelieTorgersen39.320.61903650male 2007you are fine
AdelieTorgersen38.917.81813625female2007you are fine
AdelieTorgersen39.219.61954675male 2007you are fat
AdelieTorgersen41.117.61823200female2007you are fine
AdelieTorgersen38.621.21913800male 2007you are fine
AdelieTorgersen34.621.11984400male 2007you are fat
AdelieTorgersen36.617.81853700female2007you are fine
AdelieTorgersen38.719.01953450female2007you are fine
AdelieTorgersen42.520.71974500male 2007you are fat
AdelieTorgersen34.418.41843325female2007you are fine
AdelieTorgersen46.021.51944200male 2007you are fine
AdelieBiscoe 37.818.31743400female2007you are fine
AdelieBiscoe 37.718.71803600male 2007you are fine
AdelieBiscoe 35.919.21893800female2007you are fine
AdelieBiscoe 38.218.11853950male 2007you are fine
AdelieBiscoe 38.817.21803800male 2007you are fine
AdelieBiscoe 35.318.91873800female2007you are fine
AdelieBiscoe 40.618.61833550male 2007you are fine
AdelieBiscoe 40.517.91873200female2007you are fine
AdelieBiscoe 37.918.61723150female2007you are fine
AdelieBiscoe 40.518.91803950male 2007you are fine
AdelieDream 39.516.71783250female2007you are fine
AdelieDream 37.218.11783900male 2007you are fine
AdelieDream 39.517.81883300female2007you are fine
AdelieDream 40.918.91843900male 2007you are fine
AdelieDream 36.417.01953325female2007you are fine
ChinstrapDream46.916.61922700female2008you are fine
ChinstrapDream53.519.92054500male 2008you are fat
ChinstrapDream49.019.52103950male 2008you are fine
ChinstrapDream46.217.51873650female2008you are fine
ChinstrapDream50.919.11963550male 2008you are fine
ChinstrapDream45.517.01963500female2008you are fine
ChinstrapDream50.917.91963675female2009you are fine
ChinstrapDream50.818.52014450male 2009you are fat
ChinstrapDream50.117.91903400female2009you are fine
ChinstrapDream49.019.62124300male 2009you are fat
ChinstrapDream51.518.71873250male 2009you are fine
ChinstrapDream49.817.31983675female2009you are fine
ChinstrapDream48.116.41993325female2009you are fine
ChinstrapDream51.419.02013950male 2009you are fine
ChinstrapDream45.717.31933600female2009you are fine
ChinstrapDream50.719.72034050male 2009you are fine
ChinstrapDream42.517.31873350female2009you are fine
ChinstrapDream52.218.81973450male 2009you are fine
ChinstrapDream45.216.61913250female2009you are fine
ChinstrapDream49.319.92034050male 2009you are fine
ChinstrapDream50.218.82023800male 2009you are fine
ChinstrapDream45.619.41943525female2009you are fine
ChinstrapDream51.919.52063950male 2009you are fine
ChinstrapDream46.816.51893650female2009you are fine
ChinstrapDream45.717.01953650female2009you are fine
ChinstrapDream55.819.82074000male 2009you are fine
ChinstrapDream43.518.12023400female2009you are fine
ChinstrapDream49.618.21933775male 2009you are fine
ChinstrapDream50.819.02104100male 2009you are fine
ChinstrapDream50.218.71983775female2009you are fine
df <- tibble::tribble(
    ~name,     ~type, ~score,
  "Alice", "english",    80,
  "Alice",    "math",    NA,
    "Bob", "english",    70,
    "Bob",    "math",    69,
  "Carol", "english",    NA,
  "Carol",    "math",    90
  )
df
A tibble: 6 × 3
nametypescore
<chr><chr><dbl>
Aliceenglish80
Alicemath NA
Bob english70
Bob math 69
CarolenglishNA
Carolmath 90
df %>% 
  group_by(type) %>% 
  mutate(mean_score = mean(score, na.rm=TRUE)) %>% 
  mutate(new_score = if_else(is.na(score), mean_score, score))
A grouped_df: 6 × 5
nametypescoremean_scorenew_score
<chr><chr><dbl><dbl><dbl>
Aliceenglish8075.080.0
Alicemath NA79.579.5
Bob english7075.070.0
Bob math 6979.569.0
CarolenglishNA75.075.0
Carolmath 9079.590.0
penguins %>% 
  mutate(
    body = case_when(
      body_mass_g < 3500   ~ "best",
      body_mass_g >= 3500 & body_mass_g < 4500  ~ "good",
      body_mass_g >= 4500 & body_mass_g < 5500 ~ "general",
      TRUE                                     ~ "other"))
A tibble: 333 × 9
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyearbody
<fct><fct><dbl><dbl><int><int><fct><int><chr>
AdelieTorgersen39.118.71813750male 2007good
AdelieTorgersen39.517.41863800female2007good
AdelieTorgersen40.318.01953250female2007best
AdelieTorgersen36.719.31933450female2007best
AdelieTorgersen39.320.61903650male 2007good
AdelieTorgersen38.917.81813625female2007good
AdelieTorgersen39.219.61954675male 2007general
AdelieTorgersen41.117.61823200female2007best
AdelieTorgersen38.621.21913800male 2007good
AdelieTorgersen34.621.11984400male 2007good
AdelieTorgersen36.617.81853700female2007good
AdelieTorgersen38.719.01953450female2007best
AdelieTorgersen42.520.71974500male 2007general
AdelieTorgersen34.418.41843325female2007best
AdelieTorgersen46.021.51944200male 2007good
AdelieBiscoe 37.818.31743400female2007best
AdelieBiscoe 37.718.71803600male 2007good
AdelieBiscoe 35.919.21893800female2007good
AdelieBiscoe 38.218.11853950male 2007good
AdelieBiscoe 38.817.21803800male 2007good
AdelieBiscoe 35.318.91873800female2007good
AdelieBiscoe 40.618.61833550male 2007good
AdelieBiscoe 40.517.91873200female2007best
AdelieBiscoe 37.918.61723150female2007best
AdelieBiscoe 40.518.91803950male 2007good
AdelieDream 39.516.71783250female2007best
AdelieDream 37.218.11783900male 2007good
AdelieDream 39.517.81883300female2007best
AdelieDream 40.918.91843900male 2007good
AdelieDream 36.417.01953325female2007best
ChinstrapDream46.916.61922700female2008best
ChinstrapDream53.519.92054500male 2008general
ChinstrapDream49.019.52103950male 2008good
ChinstrapDream46.217.51873650female2008good
ChinstrapDream50.919.11963550male 2008good
ChinstrapDream45.517.01963500female2008good
ChinstrapDream50.917.91963675female2009good
ChinstrapDream50.818.52014450male 2009good
ChinstrapDream50.117.91903400female2009best
ChinstrapDream49.019.62124300male 2009good
ChinstrapDream51.518.71873250male 2009best
ChinstrapDream49.817.31983675female2009good
ChinstrapDream48.116.41993325female2009best
ChinstrapDream51.419.02013950male 2009good
ChinstrapDream45.717.31933600female2009good
ChinstrapDream50.719.72034050male 2009good
ChinstrapDream42.517.31873350female2009best
ChinstrapDream52.218.81973450male 2009best
ChinstrapDream45.216.61913250female2009best
ChinstrapDream49.319.92034050male 2009good
ChinstrapDream50.218.82023800male 2009good
ChinstrapDream45.619.41943525female2009good
ChinstrapDream51.919.52063950male 2009good
ChinstrapDream46.816.51893650female2009good
ChinstrapDream45.717.01953650female2009good
ChinstrapDream55.819.82074000male 2009good
ChinstrapDream43.518.12023400female2009best
ChinstrapDream49.618.21933775male 2009good
ChinstrapDream50.819.02104100male 2009good
ChinstrapDream50.218.71983775female2009good
penguins %>% 
  mutate(degree = case_when(
                  bill_length_mm < 35                        ~ "A",
      bill_length_mm >= 35 & bill_length_mm < 45 ~ "B",
      bill_length_mm >= 45 & bill_length_mm < 55 ~ "C",
      TRUE                                       ~ "D"))
A tibble: 333 × 9
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyeardegree
<fct><fct><dbl><dbl><int><int><fct><int><chr>
AdelieTorgersen39.118.71813750male 2007B
AdelieTorgersen39.517.41863800female2007B
AdelieTorgersen40.318.01953250female2007B
AdelieTorgersen36.719.31933450female2007B
AdelieTorgersen39.320.61903650male 2007B
AdelieTorgersen38.917.81813625female2007B
AdelieTorgersen39.219.61954675male 2007B
AdelieTorgersen41.117.61823200female2007B
AdelieTorgersen38.621.21913800male 2007B
AdelieTorgersen34.621.11984400male 2007A
AdelieTorgersen36.617.81853700female2007B
AdelieTorgersen38.719.01953450female2007B
AdelieTorgersen42.520.71974500male 2007B
AdelieTorgersen34.418.41843325female2007A
AdelieTorgersen46.021.51944200male 2007C
AdelieBiscoe 37.818.31743400female2007B
AdelieBiscoe 37.718.71803600male 2007B
AdelieBiscoe 35.919.21893800female2007B
AdelieBiscoe 38.218.11853950male 2007B
AdelieBiscoe 38.817.21803800male 2007B
AdelieBiscoe 35.318.91873800female2007B
AdelieBiscoe 40.618.61833550male 2007B
AdelieBiscoe 40.517.91873200female2007B
AdelieBiscoe 37.918.61723150female2007B
AdelieBiscoe 40.518.91803950male 2007B
AdelieDream 39.516.71783250female2007B
AdelieDream 37.218.11783900male 2007B
AdelieDream 39.517.81883300female2007B
AdelieDream 40.918.91843900male 2007B
AdelieDream 36.417.01953325female2007B
ChinstrapDream46.916.61922700female2008C
ChinstrapDream53.519.92054500male 2008C
ChinstrapDream49.019.52103950male 2008C
ChinstrapDream46.217.51873650female2008C
ChinstrapDream50.919.11963550male 2008C
ChinstrapDream45.517.01963500female2008C
ChinstrapDream50.917.91963675female2009C
ChinstrapDream50.818.52014450male 2009C
ChinstrapDream50.117.91903400female2009C
ChinstrapDream49.019.62124300male 2009C
ChinstrapDream51.518.71873250male 2009C
ChinstrapDream49.817.31983675female2009C
ChinstrapDream48.116.41993325female2009C
ChinstrapDream51.419.02013950male 2009C
ChinstrapDream45.717.31933600female2009C
ChinstrapDream50.719.72034050male 2009C
ChinstrapDream42.517.31873350female2009B
ChinstrapDream52.218.81973450male 2009C
ChinstrapDream45.216.61913250female2009C
ChinstrapDream49.319.92034050male 2009C
ChinstrapDream50.218.82023800male 2009C
ChinstrapDream45.619.41943525female2009C
ChinstrapDream51.919.52063950male 2009C
ChinstrapDream46.816.51893650female2009C
ChinstrapDream45.717.01953650female2009C
ChinstrapDream55.819.82074000male 2009D
ChinstrapDream43.518.12023400female2009B
ChinstrapDream49.618.21933775male 2009C
ChinstrapDream50.819.02104100male 2009C
ChinstrapDream50.218.71983775female2009C

拓展函数#

  • n()函数,统计当前分组数据框的行数

  • 统计某个变量中各组出现的次数,可以使用count()函数, 可以统计不同组合出现的次数

  • 可以在count()里构建新变量,并利用这个新变量完成统计

penguins %>% summarise(n = n())
A tibble: 1 × 1
n
<int>
333
penguins %>% group_by(species) %>% summarise(n = n())
A tibble: 3 × 2
speciesn
<fct><int>
Adelie 146
Chinstrap 68
Gentoo 119
penguins %>% count(species)
penguins %>% count(sex, sort = TRUE)
penguins %>% count(island, species)
A tibble: 3 × 2
speciesn
<fct><int>
Adelie 146
Chinstrap 68
Gentoo 119
A tibble: 2 × 2
sexn
<fct><int>
male 168
female165
A tibble: 5 × 3
islandspeciesn
<fct><fct><int>
Biscoe Adelie 44
Biscoe Gentoo 119
Dream Adelie 55
Dream Chinstrap 68
TorgersenAdelie 47
penguins %>% filter(bill_length_mm > 40) %>% summarise(n = n())
penguins %>% count(longer_bill = bill_length_mm > 40)
A tibble: 1 × 1
n
<int>
237
A tibble: 2 × 2
longer_billn
<lgl><int>
FALSE 96
TRUE237

强制转换#

  • 矢量中的元素必须是相同的类型,但如果不一样呢,会发生什么? 这个时候R会强制转换成相同的类型。这就涉及数据类型的转换层级

    • character > numeric > logical

    • double > integer

c("foo", 1, TRUE) # 强制转换成了字符串类型
  1. 'foo'
  2. '1'
  3. 'TRUE'
penguins %>% 
  mutate(is_bigger40 = bill_length_mm > 40) %>% 
  summarise(n = n())

penguins %>% 
  filter(bill_length_mm > 40) %>% 
  summarise(n = n())
A tibble: 1 × 1
n
<int>
333
A tibble: 1 × 1
n
<int>
237

across()函数#

  • 更安全、更简练的写法

penguins %>% 
  summarise(
    length = mean(bill_length_mm, na.rm=TRUE))
A tibble: 1 × 1
length
<dbl>
43.99279
penguins %>% 
  summarise(
    length = mean(bill_length_mm, na.rm=TRUE),
    depth = mean(bill_depth_mm, na.rm=TRUE)
  )
A tibble: 1 × 2
lengthdepth
<dbl><dbl>
43.9927917.16486
penguins %>% 
  summarise(
    across(c(bill_depth_mm, bill_length_mm), mean, na.rm=TRUE)
  )
Warning message:
“There was 1 warning in `summarise()`.
 In argument: `across(c(bill_depth_mm, bill_length_mm), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))”
A tibble: 1 × 2
bill_depth_mmbill_length_mm
<dbl><dbl>
17.1648643.99279
penguins %>% 
  summarise(
    across(ends_with("_mm"), mean, na.rm=TRUE))
A tibble: 1 × 3
bill_length_mmbill_depth_mmflipper_length_mm
<dbl><dbl><dbl>
43.9927917.16486200.967

数据中心化#

看数据的离散

penguins %>% 
  mutate(
    bill_length_mm = bill_length_mm - mean(bill_length_mm, na.rm=TRUE),
    bill_depth_mm = bill_depth_mm - mean(bill_depth_mm, na.rm=TRUE))
A tibble: 333 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen-4.892793 1.535135141813750male 2007
AdelieTorgersen-4.492793 0.235135141863800female2007
AdelieTorgersen-3.692793 0.835135141953250female2007
AdelieTorgersen-7.292793 2.135135141933450female2007
AdelieTorgersen-4.692793 3.435135141903650male 2007
AdelieTorgersen-5.092793 0.635135141813625female2007
AdelieTorgersen-4.792793 2.435135141954675male 2007
AdelieTorgersen-2.892793 0.435135141823200female2007
AdelieTorgersen-5.392793 4.035135141913800male 2007
AdelieTorgersen-9.392793 3.935135141984400male 2007
AdelieTorgersen-7.392793 0.635135141853700female2007
AdelieTorgersen-5.292793 1.835135141953450female2007
AdelieTorgersen-1.492793 3.535135141974500male 2007
AdelieTorgersen-9.592793 1.235135141843325female2007
AdelieTorgersen 2.007207 4.335135141944200male 2007
AdelieBiscoe -6.192793 1.135135141743400female2007
AdelieBiscoe -6.292793 1.535135141803600male 2007
AdelieBiscoe -8.092793 2.035135141893800female2007
AdelieBiscoe -5.792793 0.935135141853950male 2007
AdelieBiscoe -5.192793 0.035135141803800male 2007
AdelieBiscoe -8.692793 1.735135141873800female2007
AdelieBiscoe -3.392793 1.435135141833550male 2007
AdelieBiscoe -3.492793 0.735135141873200female2007
AdelieBiscoe -6.092793 1.435135141723150female2007
AdelieBiscoe -3.492793 1.735135141803950male 2007
AdelieDream -4.492793-0.464864861783250female2007
AdelieDream -6.792793 0.935135141783900male 2007
AdelieDream -4.492793 0.635135141883300female2007
AdelieDream -3.092793 1.735135141843900male 2007
AdelieDream -7.592793-0.164864861953325female2007
ChinstrapDream 2.9072072-0.56486491922700female2008
ChinstrapDream 9.5072072 2.73513512054500male 2008
ChinstrapDream 5.0072072 2.33513512103950male 2008
ChinstrapDream 2.2072072 0.33513511873650female2008
ChinstrapDream 6.9072072 1.93513511963550male 2008
ChinstrapDream 1.5072072-0.16486491963500female2008
ChinstrapDream 6.9072072 0.73513511963675female2009
ChinstrapDream 6.8072072 1.33513512014450male 2009
ChinstrapDream 6.1072072 0.73513511903400female2009
ChinstrapDream 5.0072072 2.43513512124300male 2009
ChinstrapDream 7.5072072 1.53513511873250male 2009
ChinstrapDream 5.8072072 0.13513511983675female2009
ChinstrapDream 4.1072072-0.76486491993325female2009
ChinstrapDream 7.4072072 1.83513512013950male 2009
ChinstrapDream 1.7072072 0.13513511933600female2009
ChinstrapDream 6.7072072 2.53513512034050male 2009
ChinstrapDream-1.4927928 0.13513511873350female2009
ChinstrapDream 8.2072072 1.63513511973450male 2009
ChinstrapDream 1.2072072-0.56486491913250female2009
ChinstrapDream 5.3072072 2.73513512034050male 2009
ChinstrapDream 6.2072072 1.63513512023800male 2009
ChinstrapDream 1.6072072 2.23513511943525female2009
ChinstrapDream 7.9072072 2.33513512063950male 2009
ChinstrapDream 2.8072072-0.66486491893650female2009
ChinstrapDream 1.7072072-0.16486491953650female2009
ChinstrapDream11.8072072 2.63513512074000male 2009
ChinstrapDream-0.4927928 0.93513512023400female2009
ChinstrapDream 5.6072072 1.03513511933775male 2009
ChinstrapDream 6.8072072 1.83513512104100male 2009
ChinstrapDream 6.2072072 1.53513511983775female2009
centralized <- function(x){
    x - mean(x, na.rm=TRUE)
}

penguins %>% 
  mutate(
    across(c(bill_length_mm, bill_depth_mm), centralized))
A tibble: 333 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen-4.892793 1.535135141813750male 2007
AdelieTorgersen-4.492793 0.235135141863800female2007
AdelieTorgersen-3.692793 0.835135141953250female2007
AdelieTorgersen-7.292793 2.135135141933450female2007
AdelieTorgersen-4.692793 3.435135141903650male 2007
AdelieTorgersen-5.092793 0.635135141813625female2007
AdelieTorgersen-4.792793 2.435135141954675male 2007
AdelieTorgersen-2.892793 0.435135141823200female2007
AdelieTorgersen-5.392793 4.035135141913800male 2007
AdelieTorgersen-9.392793 3.935135141984400male 2007
AdelieTorgersen-7.392793 0.635135141853700female2007
AdelieTorgersen-5.292793 1.835135141953450female2007
AdelieTorgersen-1.492793 3.535135141974500male 2007
AdelieTorgersen-9.592793 1.235135141843325female2007
AdelieTorgersen 2.007207 4.335135141944200male 2007
AdelieBiscoe -6.192793 1.135135141743400female2007
AdelieBiscoe -6.292793 1.535135141803600male 2007
AdelieBiscoe -8.092793 2.035135141893800female2007
AdelieBiscoe -5.792793 0.935135141853950male 2007
AdelieBiscoe -5.192793 0.035135141803800male 2007
AdelieBiscoe -8.692793 1.735135141873800female2007
AdelieBiscoe -3.392793 1.435135141833550male 2007
AdelieBiscoe -3.492793 0.735135141873200female2007
AdelieBiscoe -6.092793 1.435135141723150female2007
AdelieBiscoe -3.492793 1.735135141803950male 2007
AdelieDream -4.492793-0.464864861783250female2007
AdelieDream -6.792793 0.935135141783900male 2007
AdelieDream -4.492793 0.635135141883300female2007
AdelieDream -3.092793 1.735135141843900male 2007
AdelieDream -7.592793-0.164864861953325female2007
ChinstrapDream 2.9072072-0.56486491922700female2008
ChinstrapDream 9.5072072 2.73513512054500male 2008
ChinstrapDream 5.0072072 2.33513512103950male 2008
ChinstrapDream 2.2072072 0.33513511873650female2008
ChinstrapDream 6.9072072 1.93513511963550male 2008
ChinstrapDream 1.5072072-0.16486491963500female2008
ChinstrapDream 6.9072072 0.73513511963675female2009
ChinstrapDream 6.8072072 1.33513512014450male 2009
ChinstrapDream 6.1072072 0.73513511903400female2009
ChinstrapDream 5.0072072 2.43513512124300male 2009
ChinstrapDream 7.5072072 1.53513511873250male 2009
ChinstrapDream 5.8072072 0.13513511983675female2009
ChinstrapDream 4.1072072-0.76486491993325female2009
ChinstrapDream 7.4072072 1.83513512013950male 2009
ChinstrapDream 1.7072072 0.13513511933600female2009
ChinstrapDream 6.7072072 2.53513512034050male 2009
ChinstrapDream-1.4927928 0.13513511873350female2009
ChinstrapDream 8.2072072 1.63513511973450male 2009
ChinstrapDream 1.2072072-0.56486491913250female2009
ChinstrapDream 5.3072072 2.73513512034050male 2009
ChinstrapDream 6.2072072 1.63513512023800male 2009
ChinstrapDream 1.6072072 2.23513511943525female2009
ChinstrapDream 7.9072072 2.33513512063950male 2009
ChinstrapDream 2.8072072-0.66486491893650female2009
ChinstrapDream 1.7072072-0.16486491953650female2009
ChinstrapDream11.8072072 2.63513512074000male 2009
ChinstrapDream-0.4927928 0.93513512023400female2009
ChinstrapDream 5.6072072 1.03513511933775male 2009
ChinstrapDream 6.8072072 1.83513512104100male 2009
ChinstrapDream 6.2072072 1.53513511983775female2009

数据标准化#

  • (x - mean(x))/sd(x)

  • 那一列数据的每个值都减去平均值再除以标准差

std <- function(x){
    (x-mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}
penguins %>% 
  mutate(
    across(c(bill_length_mm, bill_depth_mm), std)
  )
A tibble: 333 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen-0.8946955 0.779558951813750male 2007
AdelieTorgersen-0.8215515 0.119404281863800female2007
AdelieTorgersen-0.6752636 0.424091051953250female2007
AdelieTorgersen-1.3335592 1.084245731933450female2007
AdelieTorgersen-0.8581235 1.744400401903650male 2007
AdelieTorgersen-0.9312674 0.322528791813625female2007
AdelieTorgersen-0.8764095 1.236589111954675male 2007
AdelieTorgersen-0.5289757 0.220966531823200female2007
AdelieTorgersen-0.9861254 2.049087181913800male 2007
AdelieTorgersen-1.7175649 1.998306051984400male 2007
AdelieTorgersen-1.3518452 0.322528791853700female2007
AdelieTorgersen-0.9678394 0.931902341953450female2007
AdelieTorgersen-0.2729719 1.795181531974500male 2007
AdelieTorgersen-1.7541369 0.627215571843325female2007
AdelieTorgersen 0.3670377 2.201430561944200male 2007
AdelieBiscoe -1.1324133 0.576434441743400female2007
AdelieBiscoe -1.1506993 0.779558951803600male 2007
AdelieBiscoe -1.4798471 1.033464601893800female2007
AdelieBiscoe -1.0592694 0.474872181853950male 2007
AdelieBiscoe -0.9495534 0.017842021803800male 2007
AdelieBiscoe -1.5895630 0.881121211873800female2007
AdelieBiscoe -0.6204057 0.728777821833550male 2007
AdelieBiscoe -0.6386916 0.373309921873200female2007
AdelieBiscoe -1.1141273 0.728777821723150female2007
AdelieBiscoe -0.6386916 0.881121211803950male 2007
AdelieDream -0.8215515-0.236063631783250female2007
AdelieDream -1.2421292 0.474872181783900male 2007
AdelieDream -0.8215515 0.322528791883300female2007
AdelieDream -0.5655477 0.881121211843900male 2007
AdelieDream -1.3884171-0.083720241953325female2007
ChinstrapDream 0.53161154-0.286844761922700female2008
ChinstrapDream 1.73848670 1.388932502054500male 2008
ChinstrapDream 0.91561728 1.185807982103950male 2008
ChinstrapDream 0.40360963 0.170185411873650female2008
ChinstrapDream 1.26305103 0.982683471963550male 2008
ChinstrapDream 0.27560772-0.083720241963500female2008
ChinstrapDream 1.26305103 0.373309921963675female2009
ChinstrapDream 1.24476505 0.677996692014450male 2009
ChinstrapDream 1.11676314 0.373309921903400female2009
ChinstrapDream 0.91561728 1.236589112124300male 2009
ChinstrapDream 1.37276696 0.779558951873250male 2009
ChinstrapDream 1.06190517 0.068623151983675female2009
ChinstrapDream 0.75104339-0.388407011993325female2009
ChinstrapDream 1.35448097 0.931902342013950male 2009
ChinstrapDream 0.31217969 0.068623151933600female2009
ChinstrapDream 1.22647906 1.287370242034050male 2009
ChinstrapDream-0.27297190 0.068623151873350female2009
ChinstrapDream 1.50076887 0.830340081973450male 2009
ChinstrapDream 0.22074976-0.286844761913250female2009
ChinstrapDream 0.97047524 1.388932502034050male 2009
ChinstrapDream 1.13504912 0.830340082023800male 2009
ChinstrapDream 0.29389371 1.135026861943525female2009
ChinstrapDream 1.44591091 1.185807982063950male 2009
ChinstrapDream 0.51332555-0.337625881893650female2009
ChinstrapDream 0.31217969-0.083720241953650female2009
ChinstrapDream 2.15906441 1.338151372074000male 2009
ChinstrapDream-0.09011203 0.474872182023400female2009
ChinstrapDream 1.02533320 0.525653311933775male 2009
ChinstrapDream 1.24476505 0.931902342104100male 2009
ChinstrapDream 1.13504912 0.779558951983775female2009
penguins %>% 
  mutate(
    across(c(bill_length_mm, bill_depth_mm), 
           ~(.x - mean(.x, na.rm=TRUE)) / sd(.x, na.rm=TRUE))
  )
A tibble: 333 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen-0.8946955 0.779558951813750male 2007
AdelieTorgersen-0.8215515 0.119404281863800female2007
AdelieTorgersen-0.6752636 0.424091051953250female2007
AdelieTorgersen-1.3335592 1.084245731933450female2007
AdelieTorgersen-0.8581235 1.744400401903650male 2007
AdelieTorgersen-0.9312674 0.322528791813625female2007
AdelieTorgersen-0.8764095 1.236589111954675male 2007
AdelieTorgersen-0.5289757 0.220966531823200female2007
AdelieTorgersen-0.9861254 2.049087181913800male 2007
AdelieTorgersen-1.7175649 1.998306051984400male 2007
AdelieTorgersen-1.3518452 0.322528791853700female2007
AdelieTorgersen-0.9678394 0.931902341953450female2007
AdelieTorgersen-0.2729719 1.795181531974500male 2007
AdelieTorgersen-1.7541369 0.627215571843325female2007
AdelieTorgersen 0.3670377 2.201430561944200male 2007
AdelieBiscoe -1.1324133 0.576434441743400female2007
AdelieBiscoe -1.1506993 0.779558951803600male 2007
AdelieBiscoe -1.4798471 1.033464601893800female2007
AdelieBiscoe -1.0592694 0.474872181853950male 2007
AdelieBiscoe -0.9495534 0.017842021803800male 2007
AdelieBiscoe -1.5895630 0.881121211873800female2007
AdelieBiscoe -0.6204057 0.728777821833550male 2007
AdelieBiscoe -0.6386916 0.373309921873200female2007
AdelieBiscoe -1.1141273 0.728777821723150female2007
AdelieBiscoe -0.6386916 0.881121211803950male 2007
AdelieDream -0.8215515-0.236063631783250female2007
AdelieDream -1.2421292 0.474872181783900male 2007
AdelieDream -0.8215515 0.322528791883300female2007
AdelieDream -0.5655477 0.881121211843900male 2007
AdelieDream -1.3884171-0.083720241953325female2007
ChinstrapDream 0.53161154-0.286844761922700female2008
ChinstrapDream 1.73848670 1.388932502054500male 2008
ChinstrapDream 0.91561728 1.185807982103950male 2008
ChinstrapDream 0.40360963 0.170185411873650female2008
ChinstrapDream 1.26305103 0.982683471963550male 2008
ChinstrapDream 0.27560772-0.083720241963500female2008
ChinstrapDream 1.26305103 0.373309921963675female2009
ChinstrapDream 1.24476505 0.677996692014450male 2009
ChinstrapDream 1.11676314 0.373309921903400female2009
ChinstrapDream 0.91561728 1.236589112124300male 2009
ChinstrapDream 1.37276696 0.779558951873250male 2009
ChinstrapDream 1.06190517 0.068623151983675female2009
ChinstrapDream 0.75104339-0.388407011993325female2009
ChinstrapDream 1.35448097 0.931902342013950male 2009
ChinstrapDream 0.31217969 0.068623151933600female2009
ChinstrapDream 1.22647906 1.287370242034050male 2009
ChinstrapDream-0.27297190 0.068623151873350female2009
ChinstrapDream 1.50076887 0.830340081973450male 2009
ChinstrapDream 0.22074976-0.286844761913250female2009
ChinstrapDream 0.97047524 1.388932502034050male 2009
ChinstrapDream 1.13504912 0.830340082023800male 2009
ChinstrapDream 0.29389371 1.135026861943525female2009
ChinstrapDream 1.44591091 1.185807982063950male 2009
ChinstrapDream 0.51332555-0.337625881893650female2009
ChinstrapDream 0.31217969-0.083720241953650female2009
ChinstrapDream 2.15906441 1.338151372074000male 2009
ChinstrapDream-0.09011203 0.474872182023400female2009
ChinstrapDream 1.02533320 0.525653311933775male 2009
ChinstrapDream 1.24476505 0.931902342104100male 2009
ChinstrapDream 1.13504912 0.779558951983775female2009

多列多个统计函数#

penguins %>% 
  group_by(species) %>% 
  summarise(
    across(ends_with("_mm"), list(mean = mean, sd = sd), na.rm=TRUE))
A tibble: 3 × 7
speciesbill_length_mm_meanbill_length_mm_sdbill_depth_mm_meanbill_depth_mm_sdflipper_length_mm_meanflipper_length_mm_sd
<fct><dbl><dbl><dbl><dbl><dbl><dbl>
Adelie 38.823972.66259718.347261.219338190.10276.521825
Chinstrap48.833823.33925618.420591.135395195.82357.131894
Gentoo 47.568073.10611614.996640.985998217.23536.585431
penguins %>% 
  group_by(sex) %>% 
  summarise(
    across(starts_with("bill_"), list(max = max, min = min), na.rm=TRUE))
A tibble: 2 × 5
sexbill_length_mm_maxbill_length_mm_minbill_depth_mm_maxbill_depth_mm_min
<fct><dbl><dbl><dbl><dbl>
female58.032.120.713.1
male 59.634.621.514.1

数据规整#

  • 1 宽表格变长表格gather() pivot_longer(), 推荐pivot_longer()

    • image.png

    • 参数cols,表示哪些列需要转换.

    • 参数names_to,表示cols选取的这些列的名字,构成了新的一列,这里需要取一个名字.

    • 参数values_to, 表示cols选取的这些列的值,构成了新的一列,这里也需要取一个名字.

    • 数据框总的信息量不会丢失

  • 2 长表格变宽表格spread() pivot_wider()

library(tidyverse)
plant_height <- data.frame(
  Day = 1:5,
  A = c(0.7, 1.0, 1.5, 1.8, 2.2),
  B = c(0.5, 0.7, 0.9, 1.3, 1.8),
  C = c(0.3, 0.6, 1.0, 1.2, 2.2),
  D = c(0.4, 0.7, 1.2, 1.5, 3.2)
)


plant_height
A data.frame: 5 × 5
DayABCD
<int><dbl><dbl><dbl><dbl>
10.70.50.30.4
21.00.70.60.7
31.50.91.01.2
41.81.31.21.5
52.21.82.23.2
melted <- plant_height %>% 
  pivot_longer(cols = A:D,   # -Day or c(A, B, C, D) or c("A", "B", "C", "D")
               names_to = "plant",
               values_to = "height")
melted

melted1 <- gather(plant_height, variable, value, A:D)
melted1
A tibble: 20 × 3
Dayplantheight
<int><chr><dbl>
1A0.7
1B0.5
1C0.3
1D0.4
2A1.0
2B0.7
2C0.6
2D0.7
3A1.5
3B0.9
3C1.0
3D1.2
4A1.8
4B1.3
4C1.2
4D1.5
5A2.2
5B1.8
5C2.2
5D3.2
A data.frame: 20 × 3
Dayvariablevalue
<int><chr><dbl>
1A0.7
2A1.0
3A1.5
4A1.8
5A2.2
1B0.5
2B0.7
3B0.9
4B1.3
5B1.8
1C0.3
2C0.6
3C1.0
4C1.2
5C2.2
1D0.4
2D0.7
3D1.2
4D1.5
5D3.2
melted %>% 
  ggplot(aes(x=Day, y=height, color=plant))+
  geom_line()
../_images/59d1503de4e4e54b5f719310e6a0b37dc225d1eca65773cabb839ddbe4fb041c.png
wide <- melted %>% 
  pivot_wider(
  names_from = "plant",
  values_from = "height"
)
wide
A tibble: 5 × 5
DayABCD
<int><dbl><dbl><dbl><dbl>
10.70.50.30.4
21.00.70.60.7
31.50.91.01.2
41.81.31.21.5
52.21.82.23.2
  • 列名转换成多个变量

    • pivot_longer()中的names_pattern = 指定表头属性

    • pivot_wider()中的names_glue = 指定转变后表头样式

plant_record <- data.frame(
         day = c(1L, 2L, 3L, 4L, 5L),
    A_height = c(1.1, 1.2, 1.3, 1.4, 1.5),
     A_width = c(2.1, 2.2, 2.3, 2.4, 2.5),
     A_depth = c(3.1, 3.2, 3.3, 3.4, 3.5),
    B_height = c(4.1, 4.2, 4.3, 4.4, 4.5),
     B_width = c(5.1, 5.2, 5.3, 5.4, 5.5),
     B_depth = c(6.1, 6.2, 6.3, 6.4, 6.5),
    C_height = c(7.1, 7.2, 7.3, 7.4, 7.5),
     C_width = c(8.1, 8.2, 8.3, 8.4, 8.5),
     C_depth = c(9.1, 9.2, 9.3, 9.4, 9.5)
)
plant_record
A data.frame: 5 × 10
dayA_heightA_widthA_depthB_heightB_widthB_depthC_heightC_widthC_depth
<int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
11.12.13.14.15.16.17.18.19.1
21.22.23.24.25.26.27.28.29.2
31.32.33.34.35.36.37.38.39.3
41.42.43.44.45.46.47.48.49.4
51.52.53.54.55.56.57.58.59.5

我们想原始数据框的列名,转换成多个变量,比如A,B,C成为物种(species)变量,(height, width, depth)成为parameter变量image.png

我们希望原始数据框的列名中,一部分进入变量,一部分保持原来的列名,比如,image-2.png

## pivot_longer()函数的实现列名转换成多个变量
plant_record %>% 
  tidyr::pivot_longer(cols = !day,
                      names_to = c("species", "parameter"),
                      names_pattern = "(.*)_(.*)",
                      values_to = "value")

## pivot_longer()函数实现部分列名转变成变量,部分保持原来的列名
plant_record_longer <- plant_record %>% 
  tidyr::pivot_longer(cols = !day,
                      names_to = c("species",".value"),
                      names_pattern = "(.*)_(.*)")
plant_record_longer
# 注意 .value 而不是value,说明这里不是单个列名,
# 而是匹配得到的多个值做列名
A tibble: 45 × 4
dayspeciesparametervalue
<int><chr><chr><dbl>
1Aheight1.1
1Awidth 2.1
1Adepth 3.1
1Bheight4.1
1Bwidth 5.1
1Bdepth 6.1
1Cheight7.1
1Cwidth 8.1
1Cdepth 9.1
2Aheight1.2
2Awidth 2.2
2Adepth 3.2
2Bheight4.2
2Bwidth 5.2
2Bdepth 6.2
2Cheight7.2
2Cwidth 8.2
2Cdepth 9.2
3Aheight1.3
3Awidth 2.3
3Adepth 3.3
3Bheight4.3
3Bwidth 5.3
3Bdepth 6.3
3Cheight7.3
3Cwidth 8.3
3Cdepth 9.3
4Aheight1.4
4Awidth 2.4
4Adepth 3.4
4Bheight4.4
4Bwidth 5.4
4Bdepth 6.4
4Cheight7.4
4Cwidth 8.4
4Cdepth 9.4
5Aheight1.5
5Awidth 2.5
5Adepth 3.5
5Bheight4.5
5Bwidth 5.5
5Bdepth 6.5
5Cheight7.5
5Cwidth 8.5
5Cdepth 9.5
A tibble: 15 × 5
dayspeciesheightwidthdepth
<int><chr><dbl><dbl><dbl>
1A1.12.13.1
1B4.15.16.1
1C7.18.19.1
2A1.22.23.2
2B4.25.26.2
2C7.28.29.2
3A1.32.33.3
3B4.35.36.3
3C7.38.39.3
4A1.42.43.4
4B4.45.46.4
4C7.48.49.4
5A1.52.53.5
5B4.55.56.5
5C7.58.59.5
  • 变回去

    • image.png

us_rent_income %>%
    pivot_wider(
      names_from = variable,
      names_glue = "{variable}_{.value}",
      values_from = c(estimate, moe))
A tibble: 52 × 6
GEOIDNAMEincome_estimaterent_estimateincome_moerent_moe
<chr><chr><dbl><dbl><dbl><dbl>
01Alabama 24476 747136 3
02Alaska 32940120050813
04Arizona 27517 972148 4
05Arkansas 23789 709165 5
06California 294541358109 3
08Colorado 324011125109 5
09Connecticut 353261123195 5
10Delaware 31560107624710
11District of Columbia43198142468117
12Florida 259521077 70 3
13Georgia 27024 927106 3
15Hawaii 32453150721818
16Idaho 25298 792208 7
17Illinois 30684 952 83 3
18Indiana 27247 782117 3
19Iowa 30002 740143 4
20Kansas 29126 801208 5
21Kentucky 24702 713159 4
22Louisiana 25086 825155 4
23Maine 26841 808187 7
24Maryland 371471311152 5
25Massachusetts 344981173199 5
26Michigan 26987 824 82 3
27Minnesota 32734 906189 4
28Mississippi 22766 740194 5
29Missouri 26999 784113 4
30Montana 26249 751206 9
31Nebraska 30020 773146 4
32Nevada 290191017213 6
33New Hampshire 331721052387 9
34New Jersey 350751249148 4
35New Mexico 24457 809214 6
36New York 310571194 69 3
37North Carolina 26482 844111 3
38North Dakota 32336 775245 9
39Ohio 27435 764 94 2
40Oklahoma 26207 766101 3
41Oregon 27389 988146 4
42Pennsylvania 28923 885119 3
44Rhode Island 30210 957259 6
45South Carolina 25454 836123 4
46South Dakota 28821 696276 7
47Tennessee 25453 808102 4
48Texas 28063 952110 2
49Utah 27928 948239 6
50Vermont 29351 94536111
51Virginia 325451166202 5
53Washington 323181120113 4
54West Virginia 23707 681203 6
55Wisconsin 29868 813135 3
56Wyoming 30854 82834211
72Puerto Rico NA 464 NA 6
  • 注意 .value 而不是value,说明这里不是单个列名,而是匹配得到的多个值做列名

plant_record_longer %>% 
  pivot_wider(names_from = species,
              values_from = c(height, width, depth),
              names_glue = "{species}_{.value}")
A tibble: 5 × 10
dayA_heightB_heightC_heightA_widthB_widthC_widthA_depthB_depthC_depth
<int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
11.14.17.12.15.18.13.16.19.1
21.24.27.22.25.28.23.26.29.2
31.34.37.32.35.38.33.36.39.3
41.44.47.42.45.48.43.46.49.4
51.54.57.52.55.58.53.56.59.5

tidy data原则#

  • Hadley Wickhamt提出了数据科学tidy原则image.png

    • 一切都是数据框,任何数据都可以规整

    • 数据框的一列代表一个变量,数据框的一行代表一次观察

    • 函数处理数据时,数据框进数据框出(函数的第一个参数始终为数据框

根据Hadley Wickham的思想,这里的宽表格(plant_heigtwide)不是tidy的,只有长表格(long)才是tidy的,

melted
A tibble: 20 × 3
Dayplantheight
<int><chr><dbl>
1A0.7
1B0.5
1C0.3
1D0.4
2A1.0
2B0.7
2C0.6
2D0.7
3A1.5
3B0.9
3C1.0
3D1.2
4A1.8
4B1.3
4C1.2
4D1.5
5A2.2
5B1.8
5C2.2
5D3.2

fill()缺失值填充#

  • 根据最近进行补齐

  • 可以控制填充的方向.direction = "up"

sales <- tibble::tribble(
  ~quarter, ~year, ~sales,
      "Q1",  2000,  66013,
      "Q2",    NA,  69182,
      "Q3",    NA,  53175,
      "Q4",    NA,  21001,
      "Q1",  2001,  46036,
      "Q2",    NA,  58842,
      "Q3",    NA,  44568,
      "Q4",    NA,  50197,
      "Q1",  2002,  39113,
      "Q2",    NA,  41668,
      "Q3",    NA,  30144,
      "Q4",    NA,  52897
  )
sales
A tibble: 12 × 3
quarteryearsales
<chr><dbl><dbl>
Q1200066013
Q2 NA69182
Q3 NA53175
Q4 NA21001
Q1200146036
Q2 NA58842
Q3 NA44568
Q4 NA50197
Q1200239113
Q2 NA41668
Q3 NA30144
Q4 NA52897
sales %>% fill(year)
sales %>% fill(year, .direction="up") 
A tibble: 12 × 3
quarteryearsales
<chr><dbl><dbl>
Q1200066013
Q2200069182
Q3200053175
Q4200021001
Q1200146036
Q2200158842
Q3200144568
Q4200150197
Q1200239113
Q2200241668
Q3200230144
Q4200252897
A tibble: 12 × 3
quarteryearsales
<chr><dbl><dbl>
Q1200066013
Q2200169182
Q3200153175
Q4200121001
Q1200146036
Q2200258842
Q3200244568
Q4200250197
Q1200239113
Q2 NA41668
Q3 NA30144
Q4 NA52897

expand() complete()#

  • expand()指定数据框的若干列,根据其向量元素值,产生所有可能的交叉组合

    • nesting()用于限定只产生数据框已出现的组合

  • complete()补全,可以看作是expand(nesting()) + fill()

    • 数据在complete补全的时候,会面临有两种缺失值:

      • 1 补位的时候造成的空缺

      • 2 数据原先就存在缺失值

      • 3 补位的时候造成的空缺,可通过fill = list(x = 0)控制填充

      • 4 数据原先就存在缺失值,最好通过 explicit = FALSE显式地控制是否填充

df <- tibble::tribble(
  ~x, ~y, ~z,
  1L, 1L, 4L,
  1L, 2L, 5L,
  2L, 1L, NA,
  3L, 2L, 6L
)
df
df %>% expand(x,y)
df %>% expand(nesting(x,y))
df %>% expand(nesting(x,y), z)
A tibble: 4 × 3
xyz
<int><int><int>
11 4
12 5
21NA
32 6
A tibble: 6 × 2
xy
<int><int>
11
12
21
22
31
32
A tibble: 4 × 2
xy
<int><int>
11
12
21
32
A tibble: 16 × 3
xyz
<int><int><int>
11 4
11 5
11 6
11NA
12 4
12 5
12 6
12NA
21 4
21 5
21 6
21NA
32 4
32 5
32 6
32NA
df %>% complete(x, y)
df %>% complete(x, y, fill=list(z=0))
df %>% complete(x, y, fill = list(z = 0), explicit = FALSE)
A tibble: 6 × 3
xyz
<int><int><int>
11 4
12 5
21NA
22NA
31NA
32 6
A tibble: 6 × 3
xyz
<int><int><int>
114
125
210
220
310
326
A tibble: 6 × 3
xyz
<int><int><int>
11 4
12 5
21NA
22 0
31 0
32 6

expand_grid() crossing()#

  • expand_grid()产生一个新的数据框,每行对应着向量元素的所有交叉组合

  • 向量换成数据框也可以,其结果就是数据框行与元素的交叉组合

  • crossing()可以看作是expand_grid() + distinct(), 即crossing()在完成交叉组合之后会自动去重

expand_grid(x=c(1,1,2), y= 1:2)
crossing(x=c(1,1,2), y=1:2)
A tibble: 6 × 2
xy
<dbl><int>
11
12
11
12
21
22
A tibble: 4 × 2
xy
<dbl><int>
11
12
21
22

separate()unite()#

  • 以提供的分隔符分隔或合并某些列

  • 有时候分隔符搞不定的,可以用正则表达式,将捕获的每组弄成一列

tb <- tibble::tribble(
  ~day, ~price,
  1,   "30-45",
  2,   "40-95",
  3,   "89-65",
  4,   "45-63",
  5,   "52-42"
)
tb %>% 
  separate(price, into=c("low", "high"), sep="-") %>% 
  unite(col="price", c(low, high), sep=":", remove=FALSE)
A tibble: 5 × 4
daypricelowhigh
<dbl><chr><chr><chr>
130:453045
240:954095
389:658965
445:634563
552:425242
## 有时候分隔符搞不定的,可以用正则表达式,将捕获的每组弄成一列
dfc <- tibble(x = c("1-12week", "1-10wk", "5-12w", "01-05weeks"))
dfc

dfc %>% tidyr::extract(
  x, 
  c("start", "end", "letter"), "(\\d+)-(\\d+)([a-z]+)",
  remove = FALSE
)
A tibble: 4 × 1
x
<chr>
1-12week
1-10wk
5-12w
01-05weeks
A tibble: 4 × 4
xstartendletter
<chr><chr><chr><chr>
1-12week 1 12week
1-10wk 1 10wk
5-12w 5 12w
01-05weeks0105weeks

删除缺失值所在行drop_na()replace_na()#

  • drop_na()删除有缺失值的所有row,等效于filter(across(everything(), ~!is.na(.x))),可单独使用 df %>% drop_na()删除df中的缺失值

  • 也可指定某列,只删除该列存在缺失值的行

  • replace_na() coalesce()替换缺失值 x = replace_na(x, 0)将x列的缺失值替换为0, 可在mutate()等函数中使用

df <- tibble::tribble(
    ~name,     ~type, ~score, ~extra,
  "Alice", "english",     80,     10,
  "Alice",    "math",     NA,      5,
    "Bob", "english",     NA,      9,
    "Bob",    "math",     69,     NA,
  "Carol", "english",     80,     10,
  "Carol",    "math",     90,      5
  )

# 如果score列中有缺失值NA,就删除所在的row
df %>% 
  filter(!is.na(score))

df %>% 
  filter(
    across(score, ~ !is.na(.x))
  )
A tibble: 4 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Bob math 69NA
Carolenglish8010
Carolmath 90 5
Warning message:
“Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
 Please use `if_any()` or `if_all()` instead.”
A tibble: 4 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Bob math 69NA
Carolenglish8010
Carolmath 90 5
# 所有列,如果有缺失值NA,就删除所在的row
df %>% 
  filter(
    across(everything(), ~!is.na(.x))
  )

# 更简便的方法
df %>% 
  drop_na()
# 也可指定某咧
df %>% 
  drop_na(score)
Warning message:
“Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
 Please use `if_any()` or `if_all()` instead.”
A tibble: 3 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Carolenglish8010
Carolmath 90 5
A tibble: 3 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Carolenglish8010
Carolmath 90 5
A tibble: 4 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Bob math 69NA
Carolenglish8010
Carolmath 90 5
# 将score替换为0分
df
df %>% 
  mutate(score = replace_na(score, 0))
df %>% 
  mutate(score = coalesce(score, 0))
A tibble: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Alicemath NA 5
Bob englishNA 9
Bob math 69NA
Carolenglish8010
Carolmath 90 5
A tibble: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Alicemath 0 5
Bob english 0 9
Bob math 69NA
Carolenglish8010
Carolmath 90 5
A tibble: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Alicemath 0 5
Bob english 0 9
Bob math 69NA
Carolenglish8010
Carolmath 90 5
df %>% 
  mutate(
    across(c(score, extra), ~ replace_na(.x,0))
  )
df %>% 
  mutate(
    across(c(score, extra), ~ coalesce(.x,0)))
A tibble: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Alicemath 0 5
Bob english 0 9
Bob math 69 0
Carolenglish8010
Carolmath 90 5
A tibble: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish8010
Alicemath 0 5
Bob english 0 9
Bob math 69 0
Carolenglish8010
Carolmath 90 5
df %>% 
  mutate(
    score = replace_na(score, mean(score, na.rm=TRUE))
  )

## 也可用if_else()来做
df %>% 
  mutate(
    score = if_else(is.na(score), mean(score, na.rm=TRUE), score)
  )
A tibble: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish80.0010
Alicemath 79.75 5
Bob english79.75 9
Bob math 69.00NA
Carolenglish80.0010
Carolmath 90.00 5
A tibble: 6 × 4
nametypescoreextra
<chr><chr><dbl><dbl>
Aliceenglish80.0010
Alicemath 79.75 5
Bob english79.75 9
Bob math 69.00NA
Carolenglish80.0010
Carolmath 90.00 5