There are some problems we face often when dealing with data.
The issue: Many fundamental data processing functions exist in Base R and beyond. Sometimes they can be inconsistent or unnecessarily complex. Especially when dealing with non-standard dataframes. The result is code that is confusing and doesn’t flow i.e. nested functions
Tidyverse is most importantly a philosophy for data analysis that more often then not makes wrangling data easier. The tidyverse community have built what they describe as an opinionated group of packages. These packages readily talk to one another.
You can read their manifesto to get a better understanding of the tidy ethos.
Other tools have now been made by the Tidy community. This community also overlaps with Bioconductor. But the packages above are the linchpins that hold it together.
We will touch on the core utilities of most of these packages, and show some of their most useful functions.
We already covered plotting with ggplot.
We also will not be going into the purr or forcats packages. For a more extensive look into tidyverse, we have a full day workshop online.
## # A tibble: 6 x 5
## salmon_id common_name age_classbylength length_mm IGF1_ng_ml
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 35032 Chinook salmon yearling 147 41.3
## 2 35035 Sockeye salmon juvenile 121 NA
## 3 35036 Sockeye salmon juvenile 112 NA
## 4 35037 Steelhead juvenile 220 42.7
## 5 35038 Steelhead juvenile 152 NA
## 6 35033 Chinook salmon mixed age juvenile 444 62.1
## # A tibble: 6 x 5
## salmon_id common_name age_classbylength variable value
## <dbl> <chr> <chr> <fct> <dbl>
## 1 35032 Chinook salmon yearling length_mm 147
## 2 35032 Chinook salmon yearling IGF1_ng_ml 41.3
## 3 35033 Chinook salmon mixed age juvenile length_mm 444
## 4 35033 Chinook salmon mixed age juvenile IGF1_ng_ml 62.1
## 5 35034 Sockeye salmon juvenile length_mm 139
## 6 35034 Sockeye salmon juvenile IGF1_ng_ml NA
## # A tibble: 6 x 5
## salmon_id common_name age_classbylength variable value
## <dbl> <chr> <chr> <fct> <dbl>
## 1 35032 Chinook salmon yearling length_mm 147
## 2 35033 Chinook salmon mixed age juvenile length_mm 444
## 3 35034 Sockeye salmon juvenile length_mm 139
## 4 35035 Sockeye salmon juvenile length_mm 121
## 5 35036 Sockeye salmon juvenile length_mm 112
## 6 35037 Steelhead juvenile length_mm 220
## # A tibble: 6 x 5
## salmon_id common_name age_classbylength variable value
## <dbl> <chr> <chr> <fct> <dbl>
## 1 35032 Chinook salmon yearling IGF1_ng_ml 41.3
## 2 35033 Chinook salmon mixed age juvenile IGF1_ng_ml 62.1
## 3 35034 Sockeye salmon juvenile IGF1_ng_ml NA
## 4 35035 Sockeye salmon juvenile IGF1_ng_ml NA
## 5 35036 Sockeye salmon juvenile IGF1_ng_ml NA
## 6 35037 Steelhead juvenile IGF1_ng_ml 42.7
A tidy dataset is a data frame (or table) for which the following are true:
Our first dataframe is tidy
Consistent dataframe layouts help to ensure that all values are present and that relationships between data points are clear.
R is a vectorized programming language. R builds data frames from vectors, and R works best when its operation are vectorized.
Tidy data utilizes both these aspects of R.
=> Precise and Fast
Lets start from the beginning and tidy some data. First step is to read in data.
readr:
When you read data in with base, it goes into a regular dataframe. When you return the dataframe through typing it in the console it will just print the whole dataframe.
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## 11 361 0 1 0 0
## 12 362 3 1 15 0
## 13 363 14 6 4 1
## 14 364 7 0 1 0
## 15 366 6 0 1 0
## 16 367 42 0 51 1
## 17 368 28 0 24 0
## 18 369 1204 1034 833 478
## 19 372 2829 1864 2741 771
## 20 373 179 728 148 795
## 21 374 76 5 138 2
## 22 375 4428 6697 4970 4328
## 23 377 3170 314 2576 11
## 24 378 1839 4845 1767 2975
## 25 379 178 1 181 0
## 26 381 1617 574 1159 339
## 27 382 2874 2265 1746 1668
## 28 383 63 1632 40 721
## 29 384 148 10977 118 94
## 30 387 8899 2457 7405 1228
## 31 388 12598 171 5090 70
## 32 389 2709 193 2313 5
## 33 390 1004 0 395 0
## 34 391 1038 577 1176 164
## 35 392 1527 304 786 71
## 36 393 2949 138 1540 3
## 37 394 1525 464 1062 134
## 38 395 348 67 123 0
## 39 396 6503 702 4723 169
## 40 397 12997 410 11265 38
## 41 398 0 0 0 0
## 42 399 223 11 422 2
## 43 400 1188 147 806 56
## 44 401 0 0 0 0
## 45 402 504 218 496 80
## 46 403 289 25 166 4
## 47 405 1481 824 1004 812
## 48 406 295 175 87 35
## 49 407 4 1 2 1
## 50 408 2451 111 1523 6
## 51 409 2480 1819 1356 226
## 52 410 433 197 215 77
## 53 411 829 217 441 131
## 54 412 312 45 138 17
## 55 414 516 15 396 9
## 56 415 20 0 13 0
## 57 416 2 1 4 0
## 58 417 0 0 0 0
## 59 419 6 5 2 7
## 60 420 141 1136 94 1217
## 61 421 213 255 93 208
## 62 427 4699 11889 1729 926
## 63 429 0 0 2 0
## 64 430 34 13 22 0
## 65 432 63 0 55 1
## 66 433 38 0 26 0
## 67 434 1 1 2 0
## 68 435 408 151 284 34
## 69 440 157 2520 111 535
## 70 443 5 0 4 0
## 71 444 1583 151 747 14
## 72 445 116 15 90 1
## 73 460 34 0 68 0
## 74 462 70 0 31 1
## 75 463 1244 118 492 71
## 76 466 538 480 393 218
## 77 467 2506 402 2130 18
## 78 468 7991 6132 5307 1883
## 79 471 1272 771 1392 53
## 80 472 1389 628 739 138
## 81 473 4173 783 1901 776
## 82 474 0 0 0 0
## 83 475 284 83 467 34
## 84 476 4952 3453 4202 1416
## 85 477 13 3 26 2
## 86 478 78 121 67 0
## 87 479 17 0 4 0
## 88 480 9 5 11 1
## 89 481 1937 18 1017 34
## 90 482 157 1392 75 1660
## 91 483 1075 1454 1789 1141
## 92 486 47 0 18 0
## 93 487 29 33 19 3
## 94 488 4529 1118 2925 269
## 95 489 3465 153 3188 8
## 96 490 1610 1263 913 665
## 97 491 12 1 4 0
## 98 492 4 0 6 0
## 99 493 5011 3585 3053 743
## 100 495 0 0 0 0
When you use readr to read in your data, you instead get a special dataframe called a tibble. Tibbles have several properties that make them more user friendly i.e. When you return the tibble through typing it in the console it will print a preview, with some information about dimensions and data types.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## ENTREZ = col_double(),
## CD34_1 = col_double(),
## ORTHO_1 = col_double(),
## CD34_2 = col_double(),
## ORTHO_2 = col_double()
## )
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
Tibbles carry and display extra information. While reading in it is easy to specify data type.
untidy_counts <- read_csv("data/hemato_rnaseq_counts.csv", col_types = cols(
ENTREZ = col_character(),
CD34_1 = col_integer(),
ORTHO_1 = col_integer(),
CD34_2 = col_integer(),
ORTHO_2 = col_integer()
))
untidy_counts
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
You can use the same methods as base to interact with tibbles to subset them.
## # A tibble: 1 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## # A tibble: 100 x 1
## ENTREZ
## <chr>
## 1 350
## 2 351
## 3 353
## 4 354
## 5 355
## 6 356
## 7 357
## 8 358
## 9 359
## 10 360
## # … with 90 more rows
Tibbles are column-oriented, so if you don’t specify row/column with a comma, it will assume you want a column.
## # A tibble: 100 x 1
## ENTREZ
## <chr>
## 1 350
## 2 351
## 3 353
## 4 354
## 5 355
## 6 356
## 7 357
## 8 358
## 9 359
## 10 360
## # … with 90 more rows
The outputs thus far have produced more tibbles. Sometimes we just want a vector. Like lists, we can use double brackets.
## [1] "350" "351" "353" "354" "355" "356" "357" "358" "359" "360" "361" "362"
## [13] "363" "364" "366" "367" "368" "369" "372" "373" "374" "375" "377" "378"
## [25] "379" "381" "382" "383" "384" "387" "388" "389" "390" "391" "392" "393"
## [37] "394" "395" "396" "397" "398" "399" "400" "401" "402" "403" "405" "406"
## [49] "407" "408" "409" "410" "411" "412" "414" "415" "416" "417" "419" "420"
## [61] "421" "427" "429" "430" "432" "433" "434" "435" "440" "443" "444" "445"
## [73] "460" "462" "463" "466" "467" "468" "471" "472" "473" "474" "475" "476"
## [85] "477" "478" "479" "480" "481" "482" "483" "486" "487" "488" "489" "490"
## [97] "491" "492" "493" "495"
Also, like lists we can use a dollar sign with the column name.
## [1] "350" "351" "353" "354" "355" "356" "357" "358" "359" "360" "361" "362"
## [13] "363" "364" "366" "367" "368" "369" "372" "373" "374" "375" "377" "378"
## [25] "379" "381" "382" "383" "384" "387" "388" "389" "390" "391" "392" "393"
## [37] "394" "395" "396" "397" "398" "399" "400" "401" "402" "403" "405" "406"
## [49] "407" "408" "409" "410" "411" "412" "414" "415" "416" "417" "419" "420"
## [61] "421" "427" "429" "430" "432" "433" "434" "435" "440" "443" "444" "445"
## [73] "460" "462" "463" "466" "467" "468" "471" "472" "473" "474" "475" "476"
## [85] "477" "478" "479" "480" "481" "482" "483" "486" "487" "488" "489" "490"
## [97] "491" "492" "493" "495"
There are coercion functions similar to base to convert a regular dataframe to a tibble.
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <int> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
Once it is a tibble it is straight forward to modify the datatype using the dplyr function mutate_at.
untidy_counts_base <- as_tibble(untidy_counts_base)
untidy_counts_base <- mutate_at(untidy_counts_base, vars(ENTREZ), as.character)
untidy_counts_base
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
Some tools are not tibble friendly. Calling as.data.frame is sufficient to convert it back to a base data frame
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## 11 361 0 1 0 0
## 12 362 3 1 15 0
We will make our own tibble now from scratch, using some metadata. We are accessing some Bioconductor databases that contain annotation information for genes. First we grab the position of genes. Then we use the IDs to get gene symbols. We will cover how to do this in more detail later on here
# Lets load in some packages
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
hg19_genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
keys <- hg19_genes$gene_id
symbols <- AnnotationDbi::select(org.Hs.eg.db, keys = keys, columns = c("SYMBOL"), keytype = "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
Now we will put all the metadata together into a tibble
counts_metadata <- tibble(ID = symbols$ENTREZID, SYMBOL = symbols$SYMBOL, LENGTH = lengths(hg19_genes))
counts_metadata
## # A tibble: 23,056 x 3
## ID SYMBOL LENGTH
## <chr> <chr> <int>
## 1 1 A1BG 16043
## 2 10 NAT2 9969
## 3 100 ADA 32214
## 4 1000 CDH2 226516
## 5 10000 AKT3 355352
## 6 100008586 GAGE12F 15729
## 7 100009676 ZBTB11-AS1 2784
## 8 10001 MED6 16428
## 9 10002 NR2E3 7704
## 10 10003 NAALAD2 57962
## # … with 23,046 more rows
What is wrong with the count dataframe from a tidy viewpoint?
Remember. These are the rules:
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
What is wrong with the count dataframe from a tidy viewpoint?
Remember, these are the rules:
A single variable with multiple columns
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
When all the content is in the tibble, but it is in the wrong orientation, tidyr has some tools to move the data around quickly and easily:
pivot_longer and pivot_wider
pivot_longer allows you to collapse single variables that are spread over multiple columns. In this case new columns Sample and Counts are created. Sample is made from the column names hence names_to. While Counts is made from the values within the columns hence values_to. We are also specifying that we are tidying all columns apart from ENTREZ with the cols argument.
This recently replaced a function called gather
tidier_counts <- pivot_longer(untidy_counts, names_to = "Sample", values_to = "counts", cols = c(-ENTREZ))
tidier_counts
## # A tibble: 400 x 3
## ENTREZ Sample counts
## <chr> <chr> <int>
## 1 350 CD34_1 204
## 2 350 ORTHO_1 0
## 3 350 CD34_2 103
## 4 350 ORTHO_2 0
## 5 351 CD34_1 15586
## 6 351 ORTHO_1 479
## 7 351 CD34_2 10476
## 8 351 ORTHO_2 39
## 9 353 CD34_1 842
## 10 353 ORTHO_1 355
## # … with 390 more rows
pivot_wider allows you to go in the opposite direction to pivot_longer. We can spread single variables over multiple columns.
This recently replaced a function called spread
## # A tibble: 100 x 5
## ENTREZ CD34_1 ORTHO_1 CD34_2 ORTHO_2
## <chr> <int> <int> <int> <int>
## 1 350 204 0 103 0
## 2 351 15586 479 10476 39
## 3 353 842 355 1188 86
## 4 354 0 0 0 0
## 5 355 123 291 139 16
## 6 356 1 1 0 0
## 7 357 380 3 177 0
## 8 358 572 2225 597 4051
## 9 359 0 12 1 0
## 10 360 320 502 46 1114
## # … with 90 more rows
Remember, these are the rules:
## # A tibble: 400 x 3
## ENTREZ Sample counts
## <chr> <chr> <int>
## 1 350 CD34_1 204
## 2 350 ORTHO_1 0
## 3 350 CD34_2 103
## 4 350 ORTHO_2 0
## 5 351 CD34_1 15586
## 6 351 ORTHO_1 479
## 7 351 CD34_2 10476
## 8 351 ORTHO_2 39
## 9 353 CD34_1 842
## 10 353 ORTHO_1 355
## # … with 390 more rows
Remember, these are the rules:
Multiple variables in a single column
## # A tibble: 400 x 3
## ENTREZ Sample counts
## <chr> <chr> <int>
## 1 350 CD34_1 204
## 2 350 ORTHO_1 0
## 3 350 CD34_2 103
## 4 350 ORTHO_2 0
## 5 351 CD34_1 15586
## 6 351 ORTHO_1 479
## 7 351 CD34_2 10476
## 8 351 ORTHO_2 39
## 9 353 CD34_1 842
## 10 353 ORTHO_1 355
## # … with 390 more rows
When there are several variables crammed into a single column, tidyr can be used to split single values into several:
separate
Separate allows you to break a strings in a variable by a separator. In this case the cell type and replicate number are broken by underscore.
tidy_counts <- separate(tidier_counts, Sample, sep = "_", into=c("CellType", "Rep"), remove=TRUE)
tidy_counts
## # A tibble: 400 x 4
## ENTREZ CellType Rep counts
## <chr> <chr> <chr> <int>
## 1 350 CD34 1 204
## 2 350 ORTHO 1 0
## 3 350 CD34 2 103
## 4 350 ORTHO 2 0
## 5 351 CD34 1 15586
## 6 351 ORTHO 1 479
## 7 351 CD34 2 10476
## 8 351 ORTHO 2 39
## 9 353 CD34 1 842
## 10 353 ORTHO 1 355
## # … with 390 more rows
Unite can go the other way. This can sometime be useful i.e. if you want a specific sample ID
## # A tibble: 400 x 5
## ENTREZ Sample CellType Rep counts
## <chr> <chr> <chr> <chr> <int>
## 1 350 CD34_1 CD34 1 204
## 2 350 ORTHO_1 ORTHO 1 0
## 3 350 CD34_2 CD34 2 103
## 4 350 ORTHO_2 ORTHO 2 0
## 5 351 CD34_1 CD34 1 15586
## 6 351 ORTHO_1 ORTHO 1 479
## 7 351 CD34_2 CD34 2 10476
## 8 351 ORTHO_2 ORTHO 2 39
## 9 353 CD34_1 CD34 1 842
## 10 353 ORTHO_1 ORTHO 1 355
## # … with 390 more rows
Often when getting started code can get a little out of control and confusing. As addiotnal steps are adeed code can become nested. For example here we tidy our data in a single complicated expression
tidy_counts <- separate(pivot_longer(untidy_counts, names_to = "Sample", values_to = "counts", cols = c(-ENTREZ)), Sample, sep = "_", into = c("CellType","Rep"), remove=FALSE)
We simplified things by saving each step as an intermediate. This is not very efficient.
Piping allows you to pass the result from one expression directly into another.
magrittR package developed the %>% pipe which is integral to the tidy way of formatting code
The pattern is similar but now follows a specific logical flow:
dplyr
This way there are no intermediates and the steps in your data analysis are logical and clear unlike nested code.
tidy_counts <- untidy_counts %>%
gather(key=Sample, value=counts, -ENTREZ) %>%
separate(Sample, sep = "_", into = c("CellType","Rep"), remove=FALSE)
tidy_counts
## # A tibble: 400 x 5
## ENTREZ Sample CellType Rep counts
## <chr> <chr> <chr> <chr> <int>
## 1 350 CD34_1 CD34 1 204
## 2 351 CD34_1 CD34 1 15586
## 3 353 CD34_1 CD34 1 842
## 4 354 CD34_1 CD34 1 0
## 5 355 CD34_1 CD34 1 123
## 6 356 CD34_1 CD34 1 1
## 7 357 CD34_1 CD34 1 380
## 8 358 CD34_1 CD34 1 572
## 9 359 CD34_1 CD34 1 0
## 10 360 CD34_1 CD34 1 320
## # … with 390 more rows
A period ( . ) can be used as placeholders to represent the object being piped in. In this case we use the period to in a logical expression, which we then use to subset itself.
## # A tibble: 330 x 3
## ENTREZ Sample counts
## <chr> <chr> <int>
## 1 350 CD34_1 204
## 2 350 CD34_2 103
## 3 351 CD34_1 15586
## 4 351 ORTHO_1 479
## 5 351 CD34_2 10476
## 6 351 ORTHO_2 39
## 7 353 CD34_1 842
## 8 353 ORTHO_1 355
## 9 353 CD34_2 1188
## 10 353 ORTHO_2 86
## # … with 320 more rows
%<>% is a modification of the pipe. This is a two way pipe that will carry the variable forward, and once expression is resolved, the result then gets assigned back to the original variable.
## # A tibble: 330 x 3
## ENTREZ Sample counts
## <chr> <chr> <int>
## 1 350 CD34_1 204
## 2 350 CD34_2 103
## 3 351 CD34_1 15586
## 4 351 ORTHO_1 479
## 5 351 CD34_2 10476
## 6 351 ORTHO_2 39
## 7 353 CD34_1 842
## 8 353 ORTHO_1 355
## 9 353 CD34_2 1188
## 10 353 ORTHO_2 86
## # … with 320 more rows
Keep your eyes out for a base R pipe as it is under development at the moment |>
Often you have two dataframes and you want to join them together. Data frames can be joined on a shared variable a.k.a. a key. We want this key to be unique i.e. ENTREZ ID.
## # A tibble: 400 x 5
## ENTREZ Sample CellType Rep counts
## <chr> <chr> <chr> <chr> <int>
## 1 350 CD34_1 CD34 1 204
## 2 351 CD34_1 CD34 1 15586
## 3 353 CD34_1 CD34 1 842
## 4 354 CD34_1 CD34 1 0
## 5 355 CD34_1 CD34 1 123
## 6 356 CD34_1 CD34 1 1
## 7 357 CD34_1 CD34 1 380
## 8 358 CD34_1 CD34 1 572
## 9 359 CD34_1 CD34 1 0
## 10 360 CD34_1 CD34 1 320
## # … with 390 more rows
## # A tibble: 23,056 x 3
## ID SYMBOL LENGTH
## <chr> <chr> <int>
## 1 1 A1BG 16043
## 2 10 NAT2 9969
## 3 100 ADA 32214
## 4 1000 CDH2 226516
## 5 10000 AKT3 355352
## 6 100008586 GAGE12F 15729
## 7 100009676 ZBTB11-AS1 2784
## 8 10001 MED6 16428
## 9 10002 NR2E3 7704
## 10 10003 NAALAD2 57962
## # … with 23,046 more rows
Inner Join
Outer Join
My tidy_counts tibble is the on the left, so is the backbone for this join. This will only then show the metadata that has a matching key i.e. ENTREZID.
## # A tibble: 400 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34_1 CD34 1 204 APOH 17410
## 2 351 CD34_1 CD34 1 15586 APP 290586
## 3 353 CD34_1 CD34 1 842 APRT 2466
## 4 354 CD34_1 CD34 1 0 KLK3 5850
## 5 355 CD34_1 CD34 1 123 FAS 45990
## 6 356 CD34_1 CD34 1 1 FASLG 7828
## 7 357 CD34_1 CD34 1 380 SHROOM2 162986
## 8 358 CD34_1 CD34 1 572 AQP1 72122
## 9 359 CD34_1 CD34 1 0 AQP2 8141
## 10 360 CD34_1 CD34 1 320 AQP3 6480
## # … with 390 more rows
Now the counts_metadata tibble is the backbone for this join. Many of the keys (i.e. ENTREZID) in this tibble do not have matches in the tidy_counts_tibble. Anything without a match is filled in with NA.
tidy_counts_expressed <- right_join(tidy_counts, counts_metadata, by = c("ENTREZ" = "ID"))
tidy_counts_expressed %>% tail()
## # A tibble: 6 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 9990 <NA> <NA> <NA> NA SLC12A6 108069
## 2 9991 <NA> <NA> <NA> NA PTBP3 115950
## 3 9992 <NA> <NA> <NA> NA KCNE2 7118
## 4 9993 <NA> <NA> <NA> NA DGCR2 86173
## 5 9994 <NA> <NA> <NA> NA CASP8AP2 44537
## 6 9997 <NA> <NA> <NA> NA SCO2 2909
Filtering joins
Semi join only keeps observations in x that are matched in y. y is only used as a reference and is not in output
## # A tibble: 100 x 3
## ID SYMBOL LENGTH
## <chr> <chr> <int>
## 1 350 APOH 17410
## 2 351 APP 290586
## 3 353 APRT 2466
## 4 354 KLK3 5850
## 5 355 FAS 45990
## 6 356 FASLG 7828
## 7 357 SHROOM2 162986
## 8 358 AQP1 72122
## 9 359 AQP2 8141
## 10 360 AQP3 6480
## # … with 90 more rows
Anti join only keeps observations in x that are not matched in y. y is only used as a reference and is not in output
## # A tibble: 22,956 x 3
## ID SYMBOL LENGTH
## <chr> <chr> <int>
## 1 1 A1BG 16043
## 2 10 NAT2 9969
## 3 100 ADA 32214
## 4 1000 CDH2 226516
## 5 10000 AKT3 355352
## 6 100008586 GAGE12F 15729
## 7 100009676 ZBTB11-AS1 2784
## 8 10001 MED6 16428
## 9 10002 NR2E3 7704
## 10 10003 NAALAD2 57962
## # … with 22,946 more rows
This package contains a variety of tools to access and manipulate dataframes.
They have a common rationale:
dplyr
select allows you to work with the columns in your tibble. At its most basic select allows you to make a vector from a specific variable
## # A tibble: 400 x 1
## counts
## <int>
## 1 204
## 2 15586
## 3 842
## 4 0
## 5 123
## 6 1
## 7 380
## 8 572
## 9 0
## 10 320
## # … with 390 more rows
select allows you to make a dataframe from several variables
## # A tibble: 400 x 2
## counts ENTREZ
## <int> <chr>
## 1 204 350
## 2 15586 351
## 3 842 353
## 4 0 354
## 5 123 355
## 6 1 356
## 7 380 357
## 8 572 358
## 9 0 359
## 10 320 360
## # … with 390 more rows
select allows you to make a dataframe excluding a variable
## # A tibble: 400 x 6
## ENTREZ CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34 1 204 APOH 17410
## 2 351 CD34 1 15586 APP 290586
## 3 353 CD34 1 842 APRT 2466
## 4 354 CD34 1 0 KLK3 5850
## 5 355 CD34 1 123 FAS 45990
## 6 356 CD34 1 1 FASLG 7828
## 7 357 CD34 1 380 SHROOM2 162986
## 8 358 CD34 1 572 AQP1 72122
## 9 359 CD34 1 0 AQP2 8141
## 10 360 CD34 1 320 AQP3 6480
## # … with 390 more rows
select allows you to make a dataframe from a range of variables
## # A tibble: 400 x 4
## CellType Rep counts SYMBOL
## <chr> <chr> <int> <chr>
## 1 CD34 1 204 APOH
## 2 CD34 1 15586 APP
## 3 CD34 1 842 APRT
## 4 CD34 1 0 KLK3
## 5 CD34 1 123 FAS
## 6 CD34 1 1 FASLG
## 7 CD34 1 380 SHROOM2
## 8 CD34 1 572 AQP1
## 9 CD34 1 0 AQP2
## 10 CD34 1 320 AQP3
## # … with 390 more rows
filter allows you to work with the rows in your tibble. Through filter you can access observations based on criteria that produces a logical.
## # A tibble: 100 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34_1 CD34 1 204 APOH 17410
## 2 351 CD34_1 CD34 1 15586 APP 290586
## 3 353 CD34_1 CD34 1 842 APRT 2466
## 4 354 CD34_1 CD34 1 0 KLK3 5850
## 5 355 CD34_1 CD34 1 123 FAS 45990
## 6 356 CD34_1 CD34 1 1 FASLG 7828
## 7 357 CD34_1 CD34 1 380 SHROOM2 162986
## 8 358 CD34_1 CD34 1 572 AQP1 72122
## 9 359 CD34_1 CD34 1 0 AQP2 8141
## 10 360 CD34_1 CD34 1 320 AQP3 6480
## # … with 90 more rows
Here we use filter to access observations based on several criteria.
## # A tibble: 200 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34_1 CD34 1 204 APOH 17410
## 2 351 CD34_1 CD34 1 15586 APP 290586
## 3 353 CD34_1 CD34 1 842 APRT 2466
## 4 354 CD34_1 CD34 1 0 KLK3 5850
## 5 355 CD34_1 CD34 1 123 FAS 45990
## 6 356 CD34_1 CD34 1 1 FASLG 7828
## 7 357 CD34_1 CD34 1 380 SHROOM2 162986
## 8 358 CD34_1 CD34 1 572 AQP1 72122
## 9 359 CD34_1 CD34 1 0 AQP2 8141
## 10 360 CD34_1 CD34 1 320 AQP3 6480
## # … with 190 more rows
Similarly we can use Filter to access observations based on numerical operators
## # A tibble: 330 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34_1 CD34 1 204 APOH 17410
## 2 351 CD34_1 CD34 1 15586 APP 290586
## 3 353 CD34_1 CD34 1 842 APRT 2466
## 4 355 CD34_1 CD34 1 123 FAS 45990
## 5 356 CD34_1 CD34 1 1 FASLG 7828
## 6 357 CD34_1 CD34 1 380 SHROOM2 162986
## 7 358 CD34_1 CD34 1 572 AQP1 72122
## 8 360 CD34_1 CD34 1 320 AQP3 6480
## 9 362 CD34_1 CD34 1 3 AQP5 4187
## 10 363 CD34_1 CD34 1 14 AQP6 9946
## # … with 320 more rows
Arrange sorts the dataframe based on a specific variable, in ascending order.
## # A tibble: 400 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 354 CD34_1 CD34 1 0 KLK3 5850
## 2 359 CD34_1 CD34 1 0 AQP2 8141
## 3 361 CD34_1 CD34 1 0 AQP4 13709
## 4 398 CD34_1 CD34 1 0 ARHGDIG 2763
## 5 401 CD34_1 CD34 1 0 PHOX2A 5100
## 6 417 CD34_1 CD34 1 0 ART1 19286
## 7 429 CD34_1 CD34 1 0 ASCL1 2843
## 8 474 CD34_1 CD34 1 0 ATOH1 1065
## 9 495 CD34_1 CD34 1 0 ATP4A 13466
## 10 350 ORTHO_1 ORTHO 1 0 APOH 17410
## # … with 390 more rows
Arrange can sort based on multiple variables using the order provided. We can also use the desc() function to change the order to descending.
## # A tibble: 400 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 351 CD34_1 CD34 1 15586 APP 290586
## 2 397 CD34_1 CD34 1 12997 ARHGDIB 19613
## 3 388 CD34_1 CD34 1 12598 RHOB 2367
## 4 397 CD34_2 CD34 2 11265 ARHGDIB 19613
## 5 351 CD34_2 CD34 2 10476 APP 290586
## 6 387 CD34_1 CD34 1 8899 RHOA 52948
## 7 468 CD34_1 CD34 1 7991 ATF4 2123
## 8 387 CD34_2 CD34 2 7405 RHOA 52948
## 9 396 CD34_1 CD34 1 6503 ARHGDIA 3686
## 10 468 CD34_2 CD34 2 5307 ATF4 2123
## # … with 390 more rows
Mutate creates a new column based on some form of computation. Here we are adding a z-score for the counts.
## # A tibble: 400 x 8
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH `scale(counts)`[,1]
## <chr> <chr> <chr> <chr> <int> <chr> <int> <dbl>
## 1 350 CD34_1 CD34 1 204 APOH 17410 -0.354
## 2 351 CD34_1 CD34 1 15586 APP 290586 7.19
## 3 353 CD34_1 CD34 1 842 APRT 2466 -0.0414
## 4 354 CD34_1 CD34 1 0 KLK3 5850 -0.454
## 5 355 CD34_1 CD34 1 123 FAS 45990 -0.394
## 6 356 CD34_1 CD34 1 1 FASLG 7828 -0.454
## 7 357 CD34_1 CD34 1 380 SHROOM2 162986 -0.268
## 8 358 CD34_1 CD34 1 572 AQP1 72122 -0.174
## 9 359 CD34_1 CD34 1 0 AQP2 8141 -0.454
## 10 360 CD34_1 CD34 1 320 AQP3 6480 -0.298
## # … with 390 more rows
Mutate can be used to create a named variable in much the same way.
## # A tibble: 400 x 8
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH count_zscore[,1]
## <chr> <chr> <chr> <chr> <int> <chr> <int> <dbl>
## 1 350 CD34_1 CD34 1 204 APOH 17410 -0.354
## 2 351 CD34_1 CD34 1 15586 APP 290586 7.19
## 3 353 CD34_1 CD34 1 842 APRT 2466 -0.0414
## 4 354 CD34_1 CD34 1 0 KLK3 5850 -0.454
## 5 355 CD34_1 CD34 1 123 FAS 45990 -0.394
## 6 356 CD34_1 CD34 1 1 FASLG 7828 -0.454
## 7 357 CD34_1 CD34 1 380 SHROOM2 162986 -0.268
## 8 358 CD34_1 CD34 1 572 AQP1 72122 -0.174
## 9 359 CD34_1 CD34 1 0 AQP2 8141 -0.454
## 10 360 CD34_1 CD34 1 320 AQP3 6480 -0.298
## # … with 390 more rows
Summarize applies aggregating or summary function to a group i.e. counting. Lets see how many genes we have no counts for in each sample. First we will generate a table of genes with no counts.
## # A tibble: 70 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 354 CD34_1 CD34 1 0 KLK3 5850
## 2 359 CD34_1 CD34 1 0 AQP2 8141
## 3 361 CD34_1 CD34 1 0 AQP4 13709
## 4 398 CD34_1 CD34 1 0 ARHGDIG 2763
## 5 401 CD34_1 CD34 1 0 PHOX2A 5100
## 6 417 CD34_1 CD34 1 0 ART1 19286
## 7 429 CD34_1 CD34 1 0 ASCL1 2843
## 8 474 CD34_1 CD34 1 0 ATOH1 1065
## 9 495 CD34_1 CD34 1 0 ATP4A 13466
## 10 350 ORTHO_1 ORTHO 1 0 APOH 17410
## # … with 60 more rows
Now we have some data to play with, we do the first key step in summarization: group_by. This adds the Group property. When we summarize it will be within these groups i.e. in this case its within Sample.
## # A tibble: 70 x 7
## # Groups: Sample [4]
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 354 CD34_1 CD34 1 0 KLK3 5850
## 2 359 CD34_1 CD34 1 0 AQP2 8141
## 3 361 CD34_1 CD34 1 0 AQP4 13709
## 4 398 CD34_1 CD34 1 0 ARHGDIG 2763
## 5 401 CD34_1 CD34 1 0 PHOX2A 5100
## 6 417 CD34_1 CD34 1 0 ART1 19286
## 7 429 CD34_1 CD34 1 0 ASCL1 2843
## 8 474 CD34_1 CD34 1 0 ATOH1 1065
## 9 495 CD34_1 CD34 1 0 ATP4A 13466
## 10 350 ORTHO_1 ORTHO 1 0 APOH 17410
## # … with 60 more rows
Once the tibble has group information we can then summarize over the these groups. We can use the n() function to count how many observations there are for each member of our group, Samples.
## # A tibble: 4 x 2
## Sample `n()`
## * <chr> <int>
## 1 CD34_1 9
## 2 CD34_2 8
## 3 ORTHO_1 22
## 4 ORTHO_2 31
We can use Summarize to apply more complex aggregating or summary functions i.e. mean. In this case we will calculate the mean gene counts across conditions.
## # A tibble: 100 x 2
## ENTREZ counts_mean
## * <chr> <dbl>
## 1 350 76.8
## 2 351 6645
## 3 353 618.
## 4 354 0
## 5 355 142.
## 6 356 0.5
## 7 357 140
## 8 358 1861.
## 9 359 3.25
## 10 360 496.
## # … with 90 more rows
We can use also use multiple groups, for group_by. We can use this to generate a more useful mean; the mean counts in genes per CellType.
## # A tibble: 200 x 3
## # Groups: ENTREZ [100]
## ENTREZ CellType counts_mean
## <chr> <chr> <dbl>
## 1 350 CD34 154.
## 2 350 ORTHO 0
## 3 351 CD34 13031
## 4 351 ORTHO 259
## 5 353 CD34 1015
## 6 353 ORTHO 220.
## 7 354 CD34 0
## 8 354 ORTHO 0
## 9 355 CD34 131
## 10 355 ORTHO 154.
## # … with 190 more rows
Grouping can also be used to help filter within groups. Here we use order to get the genes with the top 3 amount of counts.
## # A tibble: 12 x 7
## # Groups: Sample [4]
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34_1 CD34 1 204 APOH 17410
## 2 395 CD34_1 CD34 1 348 ARHGAP6 528159
## 3 414 CD34_1 CD34 1 516 ARSD 25406
## 4 388 ORTHO_1 ORTHO 1 171 RHOB 2367
## 5 392 ORTHO_1 ORTHO 1 304 ARHGAP1 23591
## 6 471 ORTHO_1 ORTHO 1 771 ATIC 37818
## 7 351 CD34_2 CD34 2 10476 APP 290586
## 8 379 CD34_2 CD34 2 181 ARL4D 2152
## 9 417 CD34_2 CD34 2 0 ART1 19286
## 10 390 ORTHO_2 ORTHO 2 0 RND3 19503
## 11 398 ORTHO_2 ORTHO 2 0 ARHGDIG 2763
## 12 443 ORTHO_2 ORTHO 2 0 ASPA 25297
We can stitch this together to do more complicated operations. In this case I am filtering to observations without 0 counts. I then am grouping by CellType and ENTREZ.
## # A tibble: 330 x 7
## # Groups: CellType, ENTREZ [174]
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34_1 CD34 1 204 APOH 17410
## 2 351 CD34_1 CD34 1 15586 APP 290586
## 3 353 CD34_1 CD34 1 842 APRT 2466
## 4 355 CD34_1 CD34 1 123 FAS 45990
## 5 356 CD34_1 CD34 1 1 FASLG 7828
## 6 357 CD34_1 CD34 1 380 SHROOM2 162986
## 7 358 CD34_1 CD34 1 572 AQP1 72122
## 8 360 CD34_1 CD34 1 320 AQP3 6480
## 9 362 CD34_1 CD34 1 3 AQP5 4187
## 10 363 CD34_1 CD34 1 14 AQP6 9946
## # … with 320 more rows
Now that they are grouped by CellType and ENTREZ, we can filter for all genes that have counts greater than 0, in both replicates.
## # A tibble: 312 x 7
## # Groups: CellType, ENTREZ [156]
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 350 CD34_1 CD34 1 204 APOH 17410
## 2 351 CD34_1 CD34 1 15586 APP 290586
## 3 353 CD34_1 CD34 1 842 APRT 2466
## 4 355 CD34_1 CD34 1 123 FAS 45990
## 5 357 CD34_1 CD34 1 380 SHROOM2 162986
## 6 358 CD34_1 CD34 1 572 AQP1 72122
## 7 360 CD34_1 CD34 1 320 AQP3 6480
## 8 362 CD34_1 CD34 1 3 AQP5 4187
## 9 363 CD34_1 CD34 1 14 AQP6 9946
## 10 364 CD34_1 CD34 1 7 AQP7 17570
## # … with 302 more rows
By now it is clear how useful pipes can be, to stitch together the different operations. These pipes can go straight into plots. Here we make a simple X-Y plot to compare counts in our cell types.
By now it is clear how useful pipes can be, to stitch together the different operations. These pipes can go straight into plots. Here we make a simple X-Y plot to compare counts in our cell types.
You’ve made a lovely new tibble file. Now you need to save it somewhere. Theres a wide range of writing options. Can specify the delmiter directly or use a specific function
write_delim(tidy_counts_meta, '../counts_with_metadata.csv', delim =',')
write_csv(tidy_counts_meta, '../counts_with_metadata.csv')
A key difference compared to base is that it does not write out row names. Tibbles generally don’t have rownames.
If the data you are working with involves characters from data entry often there will be errors i.e. clinical study metadata or a hand-typed list of genes of interest. Tidying data also means fixing these problems. stringr helps make this easy.
Though stringr is pretty comprehensive and covers most of what you will need, there is a sister package called stringi with even more functionality.
Many overlapping functions with base for combining, subsetting, converting and finding strings.
Extract substrings from a range. Here the 1st to 3rd character.
## [1] "Tom" "Ji-" "Mat" "Wei" "Dou"
Extract substrings from a range. Here the 2nd to 2nd to last character
## [1] "o" "i-Dun" "atthe" "e" "ou"
Many overlapping functions with base for combining, subsetting, converting and finding strings.
Assign values back to substrings. Here the 2nd to 2nd to last character is replaced with X.
## [1] "TXm" "JXg" "MXw" "WXi" "DXg"
It is easy to replace specific strings with str_replace_all.
## [1] "TXm" "JXg" "MXw" "WXi" "DXg"
## [1] "TXm" "JXg" "MXw" "WXi" "DXg"
Whitespace is often an issue for conssitent metadata. WE can use the str_replace_all function to remove whitespace.
## [1] "Tom" "Ji-Dung" "Matt" "Wei" "Doug"
Rather then replacing all whitespace, you can trim trailing and leading whitespace from strings.
## [1] "Tom" "Ji -Dung" "Matt" "Wei" "D o u g"
Can add whitespace to strings to get consistent length. Here all are 10 characters
## [1] " Tom " " Ji -Dung" " Matt " " Wei" " D o u g"
Often when manipulating strings we will also need to alter the capitalization. A common case is if we want to do a quick and easy conversion between gene symbol nomenclature.
## [1] "APOH" "APP" "APRT" "KLK3" "FAS" "FASLG"
## [1] "Apoh" "App" "Aprt" "Klk3" "Fas" "Faslg"
String manipulation can be used with tibbles and mutate.
## # A tibble: 400 x 8
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH SYMBOL2
## <chr> <chr> <chr> <chr> <int> <chr> <int> <chr>
## 1 350 CD34_1 CD34 1 204 APOH 17410 Apoh
## 2 351 CD34_1 CD34 1 15586 APP 290586 App
## 3 353 CD34_1 CD34 1 842 APRT 2466 Aprt
## 4 354 CD34_1 CD34 1 0 KLK3 5850 Klk3
## 5 355 CD34_1 CD34 1 123 FAS 45990 Fas
## 6 356 CD34_1 CD34 1 1 FASLG 7828 Faslg
## 7 357 CD34_1 CD34 1 380 SHROOM2 162986 Shroom2
## 8 358 CD34_1 CD34 1 572 AQP1 72122 Aqp1
## 9 359 CD34_1 CD34 1 0 AQP2 8141 Aqp2
## 10 360 CD34_1 CD34 1 320 AQP3 6480 Aqp3
## # … with 390 more rows
We can convert back to capitals, just as easy.
tidy_counts_meta %>%
mutate(SYMBOL2 = str_to_title(SYMBOL)) %>%
mutate(SYMBOL3 = str_to_upper(SYMBOL2))
## # A tibble: 400 x 9
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH SYMBOL2 SYMBOL3
## <chr> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr>
## 1 350 CD34_1 CD34 1 204 APOH 17410 Apoh APOH
## 2 351 CD34_1 CD34 1 15586 APP 290586 App APP
## 3 353 CD34_1 CD34 1 842 APRT 2466 Aprt APRT
## 4 354 CD34_1 CD34 1 0 KLK3 5850 Klk3 KLK3
## 5 355 CD34_1 CD34 1 123 FAS 45990 Fas FAS
## 6 356 CD34_1 CD34 1 1 FASLG 7828 Faslg FASLG
## 7 357 CD34_1 CD34 1 380 SHROOM2 162986 Shroom2 SHROOM2
## 8 358 CD34_1 CD34 1 572 AQP1 72122 Aqp1 AQP1
## 9 359 CD34_1 CD34 1 0 AQP2 8141 Aqp2 AQP2
## 10 360 CD34_1 CD34 1 320 AQP3 6480 Aqp3 AQP3
## # … with 390 more rows
Find patterns in different ways. str_detect gives a T/F whether the pattern ‘GAP’ is present in vector.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## # A tibble: 16 x 7
## ENTREZ Sample CellType Rep counts SYMBOL LENGTH
## <chr> <chr> <chr> <chr> <int> <chr> <int>
## 1 392 CD34_1 CD34 1 1527 ARHGAP1 23591
## 2 393 CD34_1 CD34 1 2949 ARHGAP4 18885
## 3 394 CD34_1 CD34 1 1525 ARHGAP5 82746
## 4 395 CD34_1 CD34 1 348 ARHGAP6 528159
## 5 392 ORTHO_1 ORTHO 1 304 ARHGAP1 23591
## 6 393 ORTHO_1 ORTHO 1 138 ARHGAP4 18885
## 7 394 ORTHO_1 ORTHO 1 464 ARHGAP5 82746
## 8 395 ORTHO_1 ORTHO 1 67 ARHGAP6 528159
## 9 392 CD34_2 CD34 2 786 ARHGAP1 23591
## 10 393 CD34_2 CD34 2 1540 ARHGAP4 18885
## 11 394 CD34_2 CD34 2 1062 ARHGAP5 82746
## 12 395 CD34_2 CD34 2 123 ARHGAP6 528159
## 13 392 ORTHO_2 ORTHO 2 71 ARHGAP1 23591
## 14 393 ORTHO_2 ORTHO 2 3 ARHGAP4 18885
## 15 394 ORTHO_2 ORTHO 2 134 ARHGAP5 82746
## 16 395 ORTHO_2 ORTHO 2 0 ARHGAP6 528159
str_subset is similar to detect, but it returns the match itself.
## [1] "ARHGAP1" "ARHGAP4" "ARHGAP5" "ARHGAP6" "ARHGAP1" "ARHGAP4" "ARHGAP5"
## [8] "ARHGAP6" "ARHGAP1" "ARHGAP4" "ARHGAP5" "ARHGAP6" "ARHGAP1" "ARHGAP4"
## [15] "ARHGAP5" "ARHGAP6"
Count gives you the total number of times your pattern appears in each character in the vector.
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
## [38] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1] 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 2 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 2 2 2
## [38] 2 2 1 1 0 1 1 1 1 1 1 1 1 1 2 1 0 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 0
## [75] 0 1 1 1 1 1 0 1 1 2 2 2 2 2 1 1 1 0 2 2 2 1 1 1 1 2
Hadley Wickham (Chief Scientist at RStudio) is the driving force behind the tidyverse. He is a good place to start at to get an idea of the future of tidy.
Hadley wrote a paper about why he thinks tidy data is best: www.jstatsoft.org/v59/i10/paper.
There is a lot of support for all things tidy at: https://www.tidyverse.org/
(This includes really great cheat sheets for each tool)
lubridate and hms: Allow managing of calendar and time formats
broom: helps tidy up standard base function i.e. lm or t.test
tidymodels: A collection of tools for preparing for and validating model functions
plyranges: dplyr equivalent for working with ranges i.e. Granges
ggplot
https://rockefelleruniversity.github.io/Plotting_In_R/
R for Data Science text book
https://r4ds.had.co.nz/
Data Science with R text book
http://garrettgman.github.io/