class: center, middle, inverse, title-slide .title[ # Introduction to R - Abridged, Session 2
] .author[ ### Rockefeller University, Bioinformatics Resource Centre ] .date[ ###
http://rockefelleruniversity.github.io/RU_introtoR_abridged/
] --- ## Overview - [Reading and writing in R](https://rockefelleruniversity.github.io/RU_introtoR_abridged/presentations/singlepage/introToR_Session2.html#reading-and-writing-data-in-r) - [Ordering, selecting and merging](https://rockefelleruniversity.github.io/RU_introtoR_abridged/presentations/singlepage/introToR_Session2.html#Ordering,_selecting_and_merging) - [Conditions and Loops](https://rockefelleruniversity.github.io/RU_introtoR_abridged/presentations/singlepage/introToR_Session2.html#Conditions_and_Loops) - [Plotting](https://rockefelleruniversity.github.io/RU_introtoR_abridged/presentations/singlepage/introToR_Session2.html#Plotting) --- class: inverse, center, middle # Reading and Writing Data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Data from external sources Most of the time, you will not be generating data in R but will be importing data from external files. A standard format for this data is a table: <table> <thead> <tr> <th style="text-align:left;"> Gene_Name </th> <th style="text-align:right;"> Sample_1.hi </th> <th style="text-align:right;"> Sample_2.hi </th> <th style="text-align:right;"> Sample_3.hi </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Gene_a </td> <td style="text-align:right;"> 2.651271 </td> <td style="text-align:right;"> 4.875517 </td> <td style="text-align:right;"> 3.365871 </td> </tr> <tr> <td style="text-align:left;"> Gene_b </td> <td style="text-align:right;"> 3.824938 </td> <td style="text-align:right;"> 3.536458 </td> <td style="text-align:right;"> 3.209887 </td> </tr> <tr> <td style="text-align:left;"> Gene_c </td> <td style="text-align:right;"> 5.711694 </td> <td style="text-align:right;"> 1.519591 </td> <td style="text-align:right;"> 3.125420 </td> </tr> <tr> <td style="text-align:left;"> Gene_d </td> <td style="text-align:right;"> 4.107880 </td> <td style="text-align:right;"> 4.694903 </td> <td style="text-align:right;"> 3.481595 </td> </tr> <tr> <td style="text-align:left;"> Gene_e </td> <td style="text-align:right;"> 10.778529 </td> <td style="text-align:right;"> 10.268044 </td> <td style="text-align:right;"> 8.836057 </td> </tr> <tr> <td style="text-align:left;"> Gene_f </td> <td style="text-align:right;"> 11.095507 </td> <td style="text-align:right;"> 9.715402 </td> <td style="text-align:right;"> 9.227733 </td> </tr> <tr> <td style="text-align:left;"> Gene_g </td> <td style="text-align:right;"> 10.092907 </td> <td style="text-align:right;"> 9.579511 </td> <td style="text-align:right;"> 8.641481 </td> </tr> <tr> <td style="text-align:left;"> Gene_h </td> <td style="text-align:right;"> 8.504310 </td> <td style="text-align:right;"> 12.623268 </td> <td style="text-align:right;"> 11.317170 </td> </tr> </tbody> </table> --- ## First we need a file to read in Hopefully you've downloaded the [course material](https://github.com/rockefelleruniversity/Intro_To_R_1Day/zipball/master), there's a table in it. Once the course material is unzipped we need to change our *working directory* into the downloaded folder. This is the viewpoint from which we R can see the files in your computer. You can use *getwd()* to check your current working directory. *dir()* shows you what folders are in the directory. And *setwd()* allows you to change the working directory. ```r getwd() dir() setwd("~/Downloads/RockefellerUniversity-Intro_To_R/r_course") ``` --- ## Data from text file with read.table() Tables from text files can be read with **read.table()** function ```r Table <- read.table("data/readThisTable.csv",sep=",",header=T) Table[1:4,1:3] ``` ``` ## Gene_Name Sample_1.hi Sample_2.hi ## 1 Gene_a 4.570237 3.230467 ## 2 Gene_b 3.561733 3.632285 ## 3 Gene_c 3.797274 2.874462 ## 4 Gene_d 3.398242 4.415202 ``` Here we have provided two arguments. - **sep** argument specifies how columns are separated in our text file. ("," for .csv, "\t" for .tsv) - **header** argument specifies whether columns have headers. --- ## Row names in read.table() read.table() allows for significant control over reading files through its many arguments. Have a look at options by using **?read.table** The **row.names** argument can be used to specify a column to use as row names for the resulting data frame. Here we use the first column as row names. ```r Table <- read.table("data/readThisTable.csv",sep=",",header=T,row.names=1) Table[1:4,1:3] ``` ``` ## Sample_1.hi Sample_2.hi Sample_3.hi ## Gene_a 4.570237 3.230467 3.351827 ## Gene_b 3.561733 3.632285 3.587523 ## Gene_c 3.797274 2.874462 4.016916 ## Gene_d 3.398242 4.415202 4.893561 ``` --- ## Data from other sources The read.table function can also read data from http. ```r URL <- "http://rockefelleruniversity.github.io/readThisTable.csv" Table <- read.table(URL,sep=",",header=T) Table[1:2,1:3] ``` ``` ## Gene_Name Sample_1.hi Sample_2.hi ## 1 Gene_a 4.111851 3.837018 ## 2 Gene_b 6.047822 5.683518 ``` --- ## Writing data to file Once we have our data analysed in R, we will want to export it to a file. The most common method is to use the write.table() function ```r write.table(Table, file="data/writeThisTable.csv", sep=",") ``` Since our data has column names but no row names, I will provide the arguments col.names and row.names to write.table() ```r write.table(Table, file="data/writeThisTable.csv", sep=",", row.names =F, col.names=T) ``` --- # Reviewing your data It is always important to know what your data is. Especially when you are reading it in for the first time. We have used indexing to get a taste of the data frames so far. But there are two functions to quickly check your data. **head()** and **tail()** return the first or last 6 lines by default. ```r head(Table) ``` ``` ## Gene_Name Sample_1.hi Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low ## 1 Gene_a 4.111851 3.837018 4.360628 3.752517 4.368069 ## 2 Gene_b 6.047822 5.683518 4.315889 3.381136 3.630273 ## 3 Gene_c 2.597068 3.316300 3.681509 4.886520 4.318289 ## 4 Gene_d 6.009197 5.927419 2.244701 6.574108 8.288831 ## 5 Gene_e 10.152509 10.218200 10.004835 2.251603 1.805168 ## 6 Gene_f 11.107868 9.592153 10.263975 3.567560 2.496475 ## Sample_1.low ## 1 3.421009 ## 2 5.560802 ## 3 5.097783 ## 4 6.857291 ## 5 2.396295 ## 6 3.587755 ``` --- # Reviewing your data ```r tail(Table) ``` ``` ## Gene_Name Sample_1.hi Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low ## 3 Gene_c 2.597068 3.316300 3.681509 4.886520 4.318289 ## 4 Gene_d 6.009197 5.927419 2.244701 6.574108 8.288831 ## 5 Gene_e 10.152509 10.218200 10.004835 2.251603 1.805168 ## 6 Gene_f 11.107868 9.592153 10.263975 3.567560 2.496475 ## 7 Gene_g 8.705787 8.949422 9.226990 10.051516 7.841664 ## 8 Gene_h 9.239039 9.839734 10.027812 11.084444 9.316200 ## Sample_1.low ## 3 5.097783 ## 4 6.857291 ## 5 2.396295 ## 6 3.587755 ## 7 9.649869 ## 8 8.742943 ``` ```r head(Table, 3) ``` ``` ## Gene_Name Sample_1.hi Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low ## 1 Gene_a 4.111851 3.837018 4.360628 3.752517 4.368069 ## 2 Gene_b 6.047822 5.683518 4.315889 3.381136 3.630273 ## 3 Gene_c 2.597068 3.316300 3.681509 4.886520 4.318289 ## Sample_1.low ## 1 3.421009 ## 2 5.560802 ## 3 5.097783 ``` --- ## The rio (R io) package We may want to import from formats other than plain text. We can make use of an R package (the rio package) which allows us to import and export data to mulitple formats. Formats include: * XML. * Matlab, SAS, SPSS and minitab output formats. * Excel and OpenOffice formats. --- ## The rio package To make use of the rio package functionality we will need to install this package to our version of R. We can do this by using the **install.packages()** function with the package we wish to install. **install.packages(_PACKAGENAME_)** ```r install.packages("rio") ``` --- ## The rio package Once we have installed a package, we will need to load it to make the functions available to us. We can load a library by using the **library()** function with package we wish to install **library(_PACKAGENAME_)** ```r library("rio") ``` --- ## The rio package The main two functions in the rio package are the **import** and **export** functions. We can use the **import()** function to read in our csv file. We simple specify our file as an argument to the **import()** function. **import(_Filename_)** ```r Table <- import("data/readThisTable.csv") Table[1:2,] ``` ``` ## Gene_Name Sample_1.hi Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low ## 1 Gene_a 4.570237 3.230467 3.351827 3.930877 4.098247 ## 2 Gene_b 3.561733 3.632285 3.587523 4.185287 1.380976 ## Sample_1.low ## 1 4.418726 ## 2 5.936990 ``` --- ## The rio package By default we will only retrieve the first sheet. We can specify the sheet by name or number using the **which** argument. ```r Table <- import("data/readThisXLS.xls", which=2) Table <- import("data/readThisXLS.xls", which="Metadata") Table[1:2,] ``` ``` ## Patient Condition Treatment ## 1 Sample_1.hi A X ## 2 Sample_2.hi A NoTreatment ``` --- ## The rio package We can export our data back to file using the **export()** function and specifying the name of the output file to the **file** argument. The **export()** function will guess the format required from the extension. ```r ExpressionScores <- Table$ExpressionScores export(ExpressionScores, file = "data/writeThisXLSX.xlsx") ``` --- class: inverse, center, middle # Ordering, selecting and merging <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Working with your data Data analysis typically starts when you want to start performing operations on whatever input data you have. Commonly types are: * Ordering * Selecting * Merging --- ## Lets get some input data ```r my_df <- read.table("data/categoriesAndExpression.txt",sep="\t",header=T) head(my_df) ``` ``` ## geneName ofInterest pathway Expression ## 1 Gene1 Selected Glycolysis 20.09519 ## 2 Gene2 Selected Glycolysis 23.00306 ## 3 Gene3 Selected Glycolysis 20.99712 ## 4 Gene4 Selected Glycolysis 43.01145 ## 5 Gene5 Selected Glycolysis 22.00567 ## 6 Gene6 Selected Glycolysis 20.99162 ``` --- ## Ordering The order function can be used to reorder objects in R. The result of this function is the numerical order of the input, from smallest to largest. ```r order(my_df[,4]) ``` ``` ## [1] 28 50 49 60 43 29 15 1 37 31 52 34 6 3 9 22 33 5 ## [19] 54 27 8 14 59 48 36 21 30 2 42 51 45 56 39 35 24 11 ## [37] 7 25 12 40 46 57 10 44 55 38 23 19 32 4 53 79 64 96 ## [55] 90 73 98 65 91 80 74 97 83 85 68 77 94 62 70 87 69 17 ## [73] 86 71 88 16 63 89 78 72 95 99 92 66 81 75 82 84 67 18 ## [91] 93 100 61 76 20 58 41 47 26 13 ``` --- ## Ordering We can use the result of order to index our data frame. This will reorder the dataframe based on the order. In this case we are reordeing based on lowest expression. ```r my_df_ordered <- my_df[order(my_df[,4]),] head(my_df_ordered) ``` ``` ## geneName ofInterest pathway Expression ## 28 Gene28 NotSelected Glycolysis 19.94369 ## 50 Gene50 NotSelected Glycolysis 19.95572 ## 49 Gene49 NotSelected Glycolysis 19.95703 ## 60 Gene60 NotSelected Glycolysis 19.97635 ## 43 Gene43 NotSelected Glycolysis 19.98250 ## 29 Gene29 NotSelected Glycolysis 20.02165 ``` --- ## Ordering Often we want to order based on the highest value i.e. we want the highest expression genes at the top of our data frame. We can use the decreasing argument to control this. of the time we actually ```r my_df_ordered <- my_df[order(my_df[,4], decreasing = T),] head(my_df_ordered) ``` ``` ## geneName ofInterest pathway Expression ## 13 Gene13 Selected Glycolysis 74.08310 ## 26 Gene26 NotSelected Glycolysis 73.98877 ## 47 Gene47 NotSelected Glycolysis 73.96610 ## 41 Gene41 NotSelected Glycolysis 73.96022 ## 58 Gene58 NotSelected Glycolysis 73.94659 ## 20 Gene20 Selected TGFb 66.09706 ``` --- ## Subsetting Another operation we often want to do is subset our dataset based on a specific condition i.e. I want to only look at Glycolysis genes, or I only want to gene above a certain expression threshold. To do this we need to use a logical operator test to see if this if something is TRUE. Here we see which genes have an expression greater than 70. ```r my_df_ordered$Expression > 70 ``` ``` ## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [97] FALSE FALSE FALSE FALSE ``` --- ## Logical operators Operators that we commonly use are: - **==** evaluates as equal. - **>** and **<** evaluates as greater or less than respectively. - **>=** and **<=** evaluates as greater than or equal or less than or equal respectively. --- ## Logical and indexing The result of these expressions is a logical vector of TRUE/FALSE values. These vectors can be used to index, just like numerical vectors. TRUE values are returned. ```r idx <- my_df_ordered$Expression > 70 my_df_ordered[idx,] ``` ``` ## geneName ofInterest pathway Expression ## 13 Gene13 Selected Glycolysis 74.08310 ## 26 Gene26 NotSelected Glycolysis 73.98877 ## 47 Gene47 NotSelected Glycolysis 73.96610 ## 41 Gene41 NotSelected Glycolysis 73.96022 ## 58 Gene58 NotSelected Glycolysis 73.94659 ``` --- ## Combining logical vectors Logical vectors can be used in combination in order to index vectors. To combine logical vectors we can use some common R operators. - **&** - Requires both logical operators to be TRUE - **|** - Requires either logical operator to be TRUE. - **!** - Reverses the logical operator, so TRUE is FALSE and FALSE is TRUE. ```r my_df_ordered[my_df_ordered$Expression > 60 & my_df_ordered$pathway == "TGFb",] ``` ``` ## geneName ofInterest pathway Expression ## 20 Gene20 Selected TGFb 66.09706 ## 76 Gene76 NotSelected TGFb 63.08147 ## 61 Gene61 NotSelected TGFb 63.04337 ## 100 Gene100 NotSelected TGFb 62.93485 ## 93 Gene93 NotSelected TGFb 62.93153 ``` --- ## The %in% operator A common task in R is to subset one vector by the values in another vector. The **%in%** operator in the context **A %in% B** creates a logical vector of whether values in **A** matches any values in of **B**. ```r my_favorite_genes <- c("Gene1","Gene10","Gene15") logical_index <- my_df$geneName %in% my_favorite_genes logical_index ``` ``` ## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE ## [13] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [97] FALSE FALSE FALSE FALSE ``` --- ## The %in% operator This can be then used to subset the values within one character vector by a those in a second. ```r my_df[logical_index,] ``` ``` ## geneName ofInterest pathway Expression ## 1 Gene1 Selected Glycolysis 20.09519 ## 10 Gene10 Selected Glycolysis 34.91377 ## 15 Gene15 Selected Glycolysis 20.06247 ``` --- ## Merging A common operation is to join two data frames by a column of common values. .pull-left[ ```r my_df2 <- read.table("data/gene_lengths.txt",sep="\t",header=T) nrow(my_df2) ``` ``` ## [1] 15 ``` ```r head(my_df2) ``` ``` ## Gene Length ## 1 Gene1 1788 ## 2 Gene3 213 ## 3 Gene5 529 ## 4 Gene7 234 ## 5 Gene8 1638 ## 6 Gene9 917 ``` ] .pull-right[ ```r nrow(my_df) ``` ``` ## [1] 100 ``` ] --- ## Merging data frames To do this we can use the **merge()** function with the data frames as the first two arguments. We can then specify the columns to merge by with the **by** argument. To keep only data pertaining to values common to both data frames the **all** argument is set to FALSE. ```r merge_df <- merge(my_df, my_df2, by.x="geneName","Gene", all=FALSE) merge_df ``` ``` ## geneName ofInterest pathway Expression Length ## 1 Gene1 Selected Glycolysis 20.09519 1788 ## 2 Gene10 Selected Glycolysis 34.91377 1882 ## 3 Gene12 Selected Glycolysis 27.01314 501 ## 4 Gene13 Selected Glycolysis 74.08310 1045 ## 5 Gene15 Selected Glycolysis 20.06247 1869 ## 6 Gene16 Selected TGFb 56.03506 851 ## 7 Gene17 Selected TGFb 54.00140 1807 ## 8 Gene18 Selected TGFb 59.04783 600 ## 9 Gene19 Selected TGFb 42.91023 1889 ## 10 Gene20 Selected TGFb 66.09706 992 ## 11 Gene3 Selected Glycolysis 20.99712 213 ## 12 Gene5 Selected Glycolysis 22.00567 529 ## 13 Gene7 Selected Glycolysis 26.07826 234 ## 14 Gene8 Selected Glycolysis 22.92961 1638 ## 15 Gene9 Selected Glycolysis 21.02250 917 ``` --- class: inverse, center, middle # Conditions and Loops <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Conditions and Loops We have looked at using logical vectors as a way to index other data types. ```r x <- 1:10 x[x < 4] ``` ``` ## [1] 1 2 3 ``` Logicals are also used in controlling how scripted procedures execute. --- ## Conditional branching Conditional branching is the evaluation of a logical to determine whether a chunk of code is executed. In R, we use the **if** statement with the logical to be evaluated in **()** and dependent code to be executed in **{}**. ```r x <- 10 y <- 4 if(x > y){ message("The value of x is ",x," which is greater than ", y) } ``` ``` ## The value of x is 10 which is greater than 4 ``` The message is printed above because x is greater than y. ```r y <- 20 if(x > y){ message("The value of x is ",x," which is greater than ", y) } ``` x is now no longer greater than y, so no message is printed. It would be better if all outcomes have a message. So we want a message telling us what was the result of the condition. --- ## else following an if .pull-left[ If we want to perform an operation when the condition is false we can follow the if() statement with an else statement. ```r x <- 3 if(x < 5){ message(x, " is less than to 5") }else{ message(x," is greater than or equal to 5") } ``` ``` ## 3 is less than to 5 ``` ] .pull-right[ With the addition of the else statement, when x is not less than 5 the code following the else statement is executed. ```r x <- 10 if(x < 5){ message(x, " is less than 5") }else{ message(x," is greater than or equal to 5") } ``` ``` ## 10 is greater than or equal to 5 ``` ] --- ## else if We may wish to execute different procedures under multiple conditions. This can be controlled in R using the else if() following an initial if() statement. ```r x <- 5 if(x > 5){ message(x," is greater than 5") }else if(x == 5){ message(x," is 5") }else{ message(x, " is less than 5") } ``` ``` ## 5 is 5 ``` --- ## ifelse() A useful function to evaluate conditional statements over vectors is the **ifelse()** function. ```r x <- 1:10 x ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` The ifelse() functions take the arguments of the condition to evaluate, the action if the condition is true and the action when condition is false. ```r ifelse(x <= 3,"lessOrEqual","more") ``` ``` ## [1] "lessOrEqual" "lessOrEqual" "lessOrEqual" "more" "more" ## [6] "more" "more" "more" "more" "more" ``` --- ## ifelse() We can use multiple nested **ifelse** functions to be apply more complex logical to vectors. ```r ifelse(x == 3,"same", ifelse(x < 3,"less","more") ) ``` ``` ## [1] "less" "less" "same" "more" "more" "more" "more" "more" "more" "more" ``` --- ## Loops The two main generic methods of looping in R are **while** and **for** - **while** - *while* loops repeat the execution of code while a condition evaluates as true. - **for** - *for* loops repeat the execution of code for a range of specified values. --- ## For loops For loops allow the user to cycle through a range of values applying an operation for every value. Here we cycle through a numeric vector and print out its value. ```r x <- 1:5 for(i in x){ message(i," ", appendLF = F) } ``` ``` ## 1 2 3 4 5 ``` Similarly we can cycle through other vector types (or lists). ```r x <- toupper(letters[1:5]) for(i in x){ message(i," ", appendLF = F) } ``` ``` ## A B C D E ``` --- ## Looping through indices We may wish to keep track of the position in x we are evaluating to retrieve the same index in other variables. A common practice is to loop though all possible index positions of x using the expression **1:length(x)**. ```r geneName <- c("Ikzf1","Myc","Igll1") expression <- c(10.4,4.3,6.5) 1:length(geneName) ``` ``` ## [1] 1 2 3 ``` ```r for(i in 1:length(geneName)){ message(geneName[i]," has an RPKM of ",expression[i]) } ``` ``` ## Ikzf1 has an RPKM of 10.4 ``` ``` ## Myc has an RPKM of 4.3 ``` ``` ## Igll1 has an RPKM of 6.5 ``` --- ## Loops and conditionals Loops can be combined with conditional statements to allow for complex control of their execution over R objects. .pull-left[ ```r x <- 1:13 for(i in 1:13){ if(i > 10){ message("Number ",i," is greater than 10") }else if(i == 10){ message("Number ",i," is 10") }else{ message("Number ",i," is less than 10") } } ``` ] .pull-right[ ``` ## Number 1 is less than 10 ``` ``` ## Number 2 is less than 10 ``` ``` ## Number 3 is less than 10 ``` ``` ## Number 4 is less than 10 ``` ``` ## Number 5 is less than 10 ``` ``` ## Number 6 is less than 10 ``` ``` ## Number 7 is less than 10 ``` ``` ## Number 8 is less than 10 ``` ``` ## Number 9 is less than 10 ``` ``` ## Number 10 is 10 ``` ``` ## Number 11 is greater than 10 ``` ``` ## Number 12 is greater than 10 ``` ``` ## Number 13 is greater than 10 ``` ] --- ## Functions to loop over data types There are functions which allow you to loop over a data type and apply a function to the subsection of that data. - **apply** - Apply function to rows or columns of a matrix/data frame and return results as a vector,matrix or list. - **lapply** - Apply function to every element of a vector or list and return results as a list. - **sapply** - Apply function to every element of a vector or list and return results as a vector,matrix or list. --- ## sapply() **sapply** (*smart apply*) acts as lapply but attempts to return the results as the most appropriate data type. Here sapply returns a vector where lapply would return lists. ```r exampleVector <- c(1,2,3,4,5) exampleList <- list(1,2,3,4,5) sapply(exampleVector, mean, na.rm=T) ``` ``` ## [1] 1 2 3 4 5 ``` ```r sapply(exampleList, mean, na.rm=T) ``` ``` ## [1] 1 2 3 4 5 ``` --- ## sapply() example .pull-left[ In this example lapply returns a list of vectors from the quantile function. ```r exampleList <- list(row1=1:5, row2=6:10, row3=11:15) exampleList ``` ``` ## $row1 ## [1] 1 2 3 4 5 ## ## $row2 ## [1] 6 7 8 9 10 ## ## $row3 ## [1] 11 12 13 14 15 ``` ] .pull-right[ ```r lapply(exampleList, quantile) ``` ``` ## $row1 ## 0% 25% 50% 75% 100% ## 1 2 3 4 5 ## ## $row2 ## 0% 25% 50% 75% 100% ## 6 7 8 9 10 ## ## $row3 ## 0% 25% 50% 75% 100% ## 11 12 13 14 15 ``` ] --- ## sapply() example 2 Here is an example of sapply parsing a result from the quantile function in a *smart* way. When a function always returns a vector of the same length, sapply will create a matrix with elements by column. ```r sapply(exampleList, quantile) ``` ``` ## row1 row2 row3 ## 0% 1 6 11 ## 25% 2 7 12 ## 50% 3 8 13 ## 75% 4 9 14 ## 100% 5 10 15 ``` --- ## sapply() example 3 When sapply cannot parse the result to a vector or matrix, a list will be returned. ```r exampleList <- list(df=data.frame(sample=paste0("patient",1:2), data=c(1,12)), vec=c(1,3,4,5)) sapply(exampleList, summary) ``` ``` ## $df ## sample data ## Length:2 Min. : 1.00 ## Class :character 1st Qu.: 3.75 ## Mode :character Median : 6.50 ## Mean : 6.50 ## 3rd Qu.: 9.25 ## Max. :12.00 ## ## $vec ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 2.50 3.50 3.25 4.25 5.00 ``` --- class: inverse, center, middle # Plotting <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Base plotting Base plotting function *plot* does a good job at estimating what kind of plot you might want. The output varies depending on what type of data type is your input. ```r plot(merge_df[,c(4,5)]) ``` <!-- --> --- ## Base plotting Many plots types like factors. This helps the plotting deal with dividing your data into categories. Here we try with a regular vector. ```r plot(merge_df[,3]) ``` ``` ## Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion ``` ``` ## Warning in min(x): no non-missing arguments to min; returning Inf ``` ``` ## Warning in max(x): no non-missing arguments to max; returning -Inf ``` <!-- --> ``` ## Error in plot.window(...) : need finite 'ylim' values ``` --- ## Base plotting Now we try with a factor. ```r merge_df[,3] <- factor(merge_df[,3]) plot(merge_df[,3]) ``` <img src="introToR_Session2_files/figure-html/unnamed-chunk-51-1.png" width="900px" /> --- ## Base plotting There are also some functions for making specific plots, like boxplot. ```r boxplot(Expression ~ pathway, merge_df) ``` <!-- --> --- ## Beyond Base plots For more advance plots we recommend you check out our ggplot2 training. [R graph gallery](https://r-graph-gallery.com/index.html) is also a really useful website that has example plots and the code used to generate them. .pull-left[ ```r library(ggplot2) ggplot(merge_df, aes(x=pathway, y=Expression, fill=pathway))+ geom_violin()+ geom_jitter(width=0.1)+ theme_linedraw()+ ggtitle("Gene Expression in Glycolyis and TGFb pathways") ``` ] .pull-right[ <!-- --> ] --- ## Time for an exercise! Exercise on functions can be found [here](../../exercises/exercises/conditionsAndLoops_exercise.html) --- ## Answers to exercise Answers can be found here [here](../../exercises/answers/conditionsAndLoops_answers.html) --- ## What we didn't cover * [Matrices](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Matrices) * [While loops](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session2.html#While_loops) * [Coercing data types](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Coercing_data_formats) * [Custom functions](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session2.html#Custom_functions) * [Making scripts](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session2.html#Scripts) --- ## Getting help - From us: Raise an issue on our [GitHub](https://github.com/RockefellerUniversity/RU_introtoR_abridged/issues). This can be suggestions, comments, edits or questions (about content or the slides themselves). - Google - Local friendly bioinformaticians and computational biologists. - [Stackoverflow](http://stackoverflow.com/) - [R-help](https://stat.ethz.ch/mailman/listinfo/r-help)