class: center, middle, inverse, title-slide .title[ # Introduction to R - Abridged, Session 1
] .author[ ### Rockefeller University, Bioinformatics Resource Centre ] .date[ ###
http://rockefelleruniversity.github.io/RU_introtoR_abridged/
] --- ## Overview - [Set up](https://rockefelleruniversity.github.io/RU_introtoR_abridged/presentations/singlepage/introToR_Session1.html#set-up) - [Background to R](https://rockefelleruniversity.github.io/RU_introtoR_abridged/presentations/singlepage/introToR_Session1.html#background-to-r) - [Data types in R](https://rockefelleruniversity.github.io/RU_introtoR_abridged/presentations/singlepage/introToR_Session1.html#data_types_in_r) --- ## For More - This is an abridged version of our full [Intro To R](https://rockefelleruniversity.github.io/Intro_To_R_1Day/) course - You can also find videos of us reviewing this course material and other courses on our [Youtube channel.](https://www.youtube.com/channel/UCemRwott-YnMt6A2ukrRUdg) - If you have specific questions, please post an issue on our GitHub [here](https://github.com/RockefellerUniversity/RU_introtoR_abridged/issues) --- class: inverse, center, middle # Set Up <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Materials All prerequisites, links to material and slides for this course can be found on github. * [Intro_To_R](https://rockefelleruniversity.github.io/RU_introtoR_abridged/) Or can be downloaded as a zip archive from here. * [Download zip](https://github.com/rockefelleruniversity/RU_introtoR_abridged/zipball/master) --- ## Course materials Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath. * **presentations/slides/** Presentations as an HTML slide show. * **presentations/singlepage/** Presentations as an HTML single page. * **presentations/r_code/** R code in presentations. * **exercises/** Practicals as HTML pages. * **answers/** Practicals with answers as HTML pages and R code solutions. --- class: inverse, center, middle # What is R? <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## What is R? **R** is a scripting language and environment for **statistical computing**. Developed by [Robert Gentleman](https://en.wikipedia.org/wiki/Robert_Gentleman_%28statistician%29) and [Ross Ihaka](https://en.wikipedia.org/wiki/Ross_Ihaka). Inheriting much from **S** (Bell labs). - Suited to high level data analysis - Open source & cross platform - Extensive graphics capabilities - Diverse range of add-on packages - Active community of developers - Thorough documentation --- ## What is R to you? .pull-left[ **R** comes with excellent "out-of-the-box" statistical and plotting capabilities. **R** provides access to 1000s of packages ([CRAN](http://cran.r-project.org/)/[MRAN](http://mran.revolutionanalytics.com/)/[R-forge](https://r-forge.r-project.org/)) which extend the basic functionality of R while maintaining high quality documentation. In particular, [Robert Gentleman](https://en.wikipedia.org/wiki/Robert_Gentleman_%28statistician%29) developed the **[Bioconductor](http://bioconductor.org/)** project where 100's of packages are directly related to computational biology and analysis of associated high-throughput experiments. ] .pull-right[ ![R use over time](imgs/RCitations.jpeg) ] --- ## How to get R? .pull-left[ Freely available from [R-project website](http://cran.ma.imperial.ac.uk/). RStudio provides an integrated development environment (IDE) which is freely available from [RStudio site](http://www.rstudio.com/) ***We will be using RStudio and R (hopefully) already installed on your machines.*** ] .pull-right[ ![R website](imgs/cran.jpeg) ![RStudio website](imgs/rstudio.jpeg) ] --- ## A quick tour of RStudio .pull-left[ Four main panels - Scripting panel - R interface - Environment and history - Files, directories and help **Let's load RStudio and take a look** ] .pull-right[ ![RStudio appearance](imgs/rstudioBlank.jpeg) ] --- class: inverse, center, middle # Data Types in R <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Different Data types in R - Simple calculations - Variables - Vectors - Matrices (we will not cover these) - Data frames - Lists --- ## Simple Calculations At its most basic, **R** can be used as a simple calculator. ```r > 3+1 ``` ``` ## [1] 4 ``` ```r > 2*2 ``` ``` ## [1] 4 ``` ```r > sqrt(25)-1 ``` ``` ## [1] 4 ``` --- ## Using functions The **sqrt(25)** demonstrates the use of functions in R. A function performs a complex operation on it's arguments and returns the result. In R, arguments are provided to a function within the parenthesis -- **( )** -- that follows the function name. So **sqrt(*ARGUMENT*)** will provide the square root of the value of ***ARGUMENT***. Other examples of functions include **min()**, **sum()**, **max()**. Note multiple arguments are separated by a comma. ```r min(2, 4, 6) ``` ``` ## [1] 2 ``` ```r sum(2, 4, 6) ``` ``` ## [1] 12 ``` ```r max(2, 4, 6) ``` ``` ## [1] 6 ``` --- ## Using functions R has many useful functions "built in" and ready to use as soon as R is loaded. An incomplete, illustrative list can be seen [here](http://www.statmethods.net/management/functions.html) In addition to R standard functions, additional functionality can be loaded into R using libraries. These include specialized tools for areas such as sequence alignment, read counting etc. If you need to see how a function works try **?** in front of the function name. ```r ?sqrt ``` Lets run [**?sqrt**](https://stat.ethz.ch/R-manual/R-devel/library/base/html/MathFun.html) in RStudio and look at the help. --- ## Using functions **Arguments have names and order** With functions such as min() and sqrt(), the arguments to be provided are obvious and the order of these arguments doesnt matter. ```r min(5, 4, 6) ``` ``` ## [1] 4 ``` ```r min(6, 4, 5) ``` ``` ## [1] 4 ``` Many functions however have an order to their arguments. Try and look at the arguments for the dir() function using [?dir](https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.files.html). ``` ?dir ``` --- ## Using functions **Setting names for arguments** Often we know the names of arguments but not necessarily their order. In cases where we want to be sure we specify the right argument, we provide names for the arguments used. ```r dir() dir(full.names=T) ``` This also means we don't have to copy out all the defaults for arguments preceding it. ```r dir(full.names=T) # Is equivalent to... dir(".", NULL, FALSE, T) ``` --- class: inverse, center, middle # Variables <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Variables As with other programming languages and even graphical calculators, **R** makes use of **variables**. A **variable** stores a value as a letter or word. In **R**, we make use of the assignment operator **<-** ```r x <- 10 ``` Now **x** holds the value of 10 ```r x ``` ``` ## [1] 10 ``` --- ## Altering variables ```r x ``` ``` ## [1] 10 ``` Variables can be altered in place ```r x <- 20 x ``` ``` ## [1] 20 ``` --- ## Using variables Variables can be used just as the values they contain. ```r x ``` ``` ## [1] 20 ``` ```r x + sqrt(25) ``` ``` ## [1] 25 ``` Variables can be used to create new variables ```r y <- x + sqrt(25) y ``` ``` ## [1] 25 ``` --- class: inverse, center, middle # Vectors <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Vectors In **R** the most basic variable or data type is a **vector**. A vector is an ordered collection of values. The x and y variables we have previously assigned are examples of a vector of length 1. ```r x ``` ``` ## [1] 20 ``` ```r length(x) ``` ``` ## [1] 1 ``` --- ## Vectors To create a multiple value vector we use the function **c()** to *combine* the supplied arguments into one vector. ```r x <- c(1,2,3,4,5,6,7,8,9,10) x ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` ```r length(x) ``` ``` ## [1] 10 ``` --- ## Vectors Vectors of continuous stretches of values can be created using a colon (**:**) as a shortcut. ```r y <- 6:10 y ``` ``` ## [1] 6 7 8 9 10 ``` --- ## Indexing Square brackets **[]** identify the position within a vector (the **index**). These indices can be used to extract relevant values from vectors. NOTE: This vector is not made of numbers. This is called a character vector. ```r z <- c("a","b","c","d","e","f") z ``` ``` ## [1] "a" "b" "c" "d" "e" "f" ``` ```r z[1] ``` ``` ## [1] "a" ``` ```r z[4] ``` ``` ## [1] "d" ``` --- ## Indexing Indices can be used to extract values from multiple positions within a vector. ```r z[c(1,4)] ``` ``` ## [1] "a" "d" ``` Negative indices can be used to extract all positions except that specified. ```r z[-5] ``` ``` ## [1] "a" "b" "c" "d" "f" ``` --- ## Indexing and replacement We can use indices to modify a specific position in a vector. ```r z ``` ``` ## [1] "a" "b" "c" "d" "e" "f" ``` ```r z[5] <- "Hello" z ``` ``` ## [1] "a" "b" "c" "d" "Hello" "f" ``` --- ## Indexing and replacement Indices can be specified using other vectors. ```r y ``` ``` ## [1] 6 7 8 9 10 ``` ```r z[y] <- "Hello again" z ``` ``` ## [1] "a" "b" "c" "d" "Hello" ## [6] "Hello again" "Hello again" "Hello again" "Hello again" "Hello again" ``` --- ## Remember! Square brackets **[]** for indexing. ```r x[1] ``` ``` ## [1] 1 ``` Parentheses **()** for function argments. ```r sqrt(4) ``` ``` ## [1] 2 ``` --- ## Arithmetic operations Vectors can also be used in arithmetic operations. When a standard arithmetic operation is applied to vector, the operation is applied to each position in a vector. ```r x <- c(1,2,3,4,5,6,7,8,9,10) x ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` ```r y <- x*2 y ``` ``` ## [1] 2 4 6 8 10 12 14 16 18 20 ``` Multiple vectors can be used within arithmetic operations. ```r x+y ``` ``` ## [1] 3 6 9 12 15 18 21 24 27 30 ``` --- ## Arithmetic operations When applying an arithmetic operation between two vectors of unequal length, the shorter will be recycled. ```r x <- c(1,2,3,4,5,6,7,8,9,10) x ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` ```r x+c(1,2) ``` ``` ## [1] 2 4 4 6 6 8 8 10 10 12 ``` ```r x+c(1,2,3) ``` ``` ## Warning in x + c(1, 2, 3): longer object length is not a multiple of shorter ## object length ``` ``` ## [1] 2 4 6 5 7 9 8 10 12 11 ``` --- class: inverse, center, middle # Factors <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Creating factors A special case of a vector is a **factor**. Factors are used to store data which may be grouped in categories (categorical data). Specifying data as categorical allows R to properly handle the data and make use of functions specific to categorical data. To create a factor from a vector we use the **factor()** function. Note that the factor now has an additional component called **"levels"** which identifies all categories within the vector. ```r vectorExample <- c("male","female","female","female") factorExample <- factor(vectorExample) factorExample ``` ``` ## [1] male female female female ## Levels: female male ``` ```r levels(factorExample) ``` ``` ## [1] "female" "male" ``` --- ## Summary() function An example of the use of levels can be seen from applying the **summary()** function to the vector and factor examples ```r summary(vectorExample) ``` ``` ## Length Class Mode ## 4 character character ``` ```r summary(factorExample) ``` ``` ## female male ## 3 1 ``` --- ## Display order of levels In our factor example, the levels have been displayed in an alphabetical order. To adjust the display order of levels in a factor, we can supply the desired display order to **levels** argument in the **factor()** function call. ```r factorExample <- factor(vectorExample, levels=c("male","female")) factorExample ``` ``` ## [1] male female female female ## Levels: male female ``` ```r summary(factorExample) ``` ``` ## male female ## 1 3 ``` --- ## Nominal factors In some cases there is no natural order to the categories such that one category is greater than the other (nominal data). By default this is not the case. ```r factorExample <- factor(vectorExample, levels=c("male","female")) factorExample[1] < factorExample[2] ``` ``` ## Warning in Ops.factor(factorExample[1], factorExample[2]): '<' not meaningful ## for factors ``` ``` ## [1] NA ``` --- ## Ordinal factors In other cases there will be a natural ordering to the categories (ordinal data). A factor can be specified to be ordered using the **ordered** argument in combination with specified levels argument. ```r factorExample <- factor(c("small","big","big","small"), ordered=TRUE,levels=c("small","big")) factorExample ``` ``` ## [1] small big big small ## Levels: small < big ``` ```r factorExample[1] < factorExample[2] ``` ``` ## [1] TRUE ``` --- ## Replacement Unlike vectors, replacing elements within a factor isn't so easy. While replacing one element with an established level is possible, replacing with a novel element will result in a warning. ```r factorExample <- factor(c("small","big","big","small")) factorExample[1] <- c("big") factorExample ``` ``` ## [1] big big big small ## Levels: big small ``` ```r factorExample[1] <- c("huge") ``` ``` ## Warning in `[<-.factor`(`*tmp*`, 1, value = "huge"): invalid factor level, NA ## generated ``` ```r factorExample ``` ``` ## [1] <NA> big big small ## Levels: big small ``` --- ## Replacement To add a new level we can use the levels argument. ```r levels(factorExample) <- c("big","small","huge") factorExample[1] <- c("huge") factorExample ``` ``` ## [1] huge big big small ## Levels: big small huge ``` --- class: inverse, center, middle # Data Frames <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- All progamming languages have a concept of a table. In **R**, the most useful type is a **data frame**. <div align="center"> <img src="imgs/ExcelMat.jpg" alt="igv" height="300" width="600"> </div> --- ## Creating data frames In R, we make use of the **data frame** object which allows us to store tables with columns of different data types. To create a data frame we can simply use the **data.frame()** function. ```r patientName <- c("patient1","patient2","patient3","patient4") patientType <- factor(rep(c("male","female"),2)) survivalTime <- c(1,30,2,20) dfExample <- data.frame(Name=patientName, Type=patientType, Survival_Time=survivalTime) dfExample ``` ``` ## Name Type Survival_Time ## 1 patient1 male 1 ## 2 patient2 female 30 ## 3 patient3 male 2 ## 4 patient4 female 20 ``` --- ## Indexing Selecting and replacing portions of a data frame can be done by **indexing** using square brackets **[]** much like for vectors. When indexing data frames, two values may be provided within the square brackets separated by a comma to retrieve information on a data frame position. The first value(s) corresponds to row(s) and the second to column(s). - ***myDataFrame[rowOfInterest,columnOfInterest]*** --- ## Indexing ```r dfExample ``` ``` ## Name Type Survival_Time ## 1 patient1 male 1 ## 2 patient2 female 30 ## 3 patient3 male 2 ## 4 patient4 female 20 ``` Value of first column, second row ```r dfExample[2,1] ``` ``` ## [1] "patient2" ``` --- ## Indexing Similarly, whole rows or columns can be extracted. Single rows and columns will return a vector. Values of second column (row index is empty!) ```r dfExample[,2] ``` ``` ## [1] male female male female ## Levels: female male ``` Values of third row (column index is empty!) ```r dfExample[3,] ``` ``` ## Name Type Survival_Time ## 3 patient3 male 2 ``` --- ## Indexing When multiple columns or row indices are specified, a new data frame is returned. Values of second and third row (column index is empty!) ```r dfExample[c(2,3),] ``` ``` ## Name Type Survival_Time ## 2 patient2 female 30 ## 3 patient3 male 2 ``` --- ## Indexing and replacement Replacement can occur in the same way we have seen for less complex data types: use the assignment operator *<-* ```r dfExample[1,3] <- "Forever" dfExample ``` ``` ## Name Type Survival_Time ## 1 patient1 male Forever ## 2 patient2 female 30 ## 3 patient3 male 2 ## 4 patient4 female 20 ``` --- ## Indexing and replacement When we work with factors, for a replacement to be successful it has to be a possible level. ```r dfExample[1,"Type"] <- "other" ``` ``` ## Warning in `[<-.factor`(`*tmp*`, iseq, value = "other"): invalid factor level, ## NA generated ``` ```r dfExample ``` ``` ## Name Type Survival_Time ## 1 patient1 <NA> Forever ## 2 patient2 female 30 ## 3 patient3 male 2 ## 4 patient4 female 20 ``` --- ## Factors in data frames It is possible to update factors in data frames just as with standard factors. ```r dfExample <- data.frame(Name=patientName,Type=patientType, Survival_Time=survivalTime) levels(dfExample[,"Type"]) <- c(levels(dfExample[,"Type"]) , "other") dfExample[1,"Type"] <- "other" dfExample ``` ``` ## Name Type Survival_Time ## 1 patient1 other 1 ## 2 patient2 female 30 ## 3 patient3 male 2 ## 4 patient4 female 20 ``` --- ## Specify columns with $ It is ALSO possible to index a data frames column by using the **$** symbol. ```r dfExample <- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime) dfExample$Survival_Time ``` ``` ## [1] 1 30 2 20 ``` --- ## Create columns with $ The **$** operator also allows for the creation of new columns for a data frame on the fly. ```r dfExample ``` ``` ## Name Type Survival_Time ## 1 patient1 male 1 ## 2 patient2 female 30 ## 3 patient3 male 2 ## 4 patient4 female 20 ``` ```r dfExample$newColumn <- rep("newData",nrow(dfExample)) dfExample ``` ``` ## Name Type Survival_Time newColumn ## 1 patient1 male 1 newData ## 2 patient2 female 30 newData ## 3 patient3 male 2 newData ## 4 patient4 female 20 newData ``` --- ## Finding dimensions To find dimensions of a data frame, the **dim()** function will provide dimensions as the row then column number while **nrow()** and **ncol()** will return just row number and column number respectively. ```r dim(dfExample) ``` ``` ## [1] 4 4 ``` ```r nrow(dfExample) ``` ``` ## [1] 4 ``` ```r ncol(dfExample) ``` ``` ## [1] 4 ``` --- ## Names The functions *colnames()* and *rownames()* can be used to interact with the names. We can access them by simply using the function or update them using assignment. ```r colnames(dfExample) ``` ``` ## [1] "Name" "Type" "Survival_Time" "newColumn" ``` ```r colnames(dfExample)[1] <- "PatientID" dfExample ``` ``` ## PatientID Type Survival_Time newColumn ## 1 patient1 male 1 newData ## 2 patient2 female 30 newData ## 3 patient3 male 2 newData ## 4 patient4 female 20 newData ``` --- ## Joining vectors and data frames A data frame can be created from multiple vectors or other data frames. **cbind()** can be used to attach data to a data frame as columns. ```r x <- 1:4 newDF <- cbind(x,dfExample) newDF ``` ``` ## x PatientID Type Survival_Time newColumn ## 1 1 patient1 male 1 newData ## 2 2 patient2 female 30 newData ## 3 3 patient3 male 2 newData ## 4 4 patient4 female 20 newData ``` --- ## Joining vectors and data frames **rbind()** functions to bind to a data frame as rows. ```r z <- c(5,"patient5","male",25) newerDF <- rbind(newDF,z) ``` ``` ## Warning in rbind(deparse.level, ...): number of columns of result, 5, is not a ## multiple of vector length 4 of arg 2 ``` ```r newerDF ``` ``` ## x PatientID Type Survival_Time newColumn ## 1 1 patient1 male 1 newData ## 2 2 patient2 female 30 newData ## 3 3 patient3 male 2 newData ## 4 4 patient4 female 20 newData ## 5 5 patient5 male 25 5 ``` --- class: inverse, center, middle # Lists <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Creating lists Lists are the final data type we will look at. In R, lists provide a general container which can hold any data type. ```r firstElement <- 10 secondElement <- c("a","b","c","d") thirdElement <- data.frame(colOne=c(1,2,4,5),colTwo=c("One","Two","Three","Four")) ``` --- ## lists To create a list we can simply use the **list()** function with arguments specifying the data we wish to include in the list. ```r myList <- list(firstElement, secondElement, thirdElement) myList ``` ``` ## [[1]] ## [1] 10 ## ## [[2]] ## [1] "a" "b" "c" "d" ## ## [[3]] ## colOne colTwo ## 1 1 One ## 2 2 Two ## 3 4 Three ## 4 5 Four ``` --- ## Indexing List, as with other data types in R, can be indexed. In contrast to other types, using **[]** on a list will subset the list to another list of selected indices. To retrieve an element from a list in R , two square brackets **[[]]** must be used. ```r myList <- list(firstElement,secondElement,thirdElement) myList[1] ``` ``` ## [[1]] ## [1] 10 ``` ```r myList[[1]] ``` ``` ## [1] 10 ``` --- ## Joining lists Again, similar to vectors, lists can be joined together in R using the c() function ```r myNamedList <- list(First=firstElement,Second=secondElement, Third=thirdElement) myNamedList <- c(myNamedList,list(fourth=c(4,4))) myNamedList[c(1,4)] ``` ``` ## $First ## [1] 10 ## ## $fourth ## [1] 4 4 ``` --- ## Flattening lists Sometimes you will wish to "flatten" out a list. When a list contains compatible objects, i.e. list of all one type, the **unlist()** function can be used. ```r myNamedList <- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7)) myNamedList ``` ``` ## $First ## [1] 1 2 3 ## ## $Second ## [1] 2 6 7 ## ## $Third ## [1] 1 4 7 ``` ```r flatList <- unlist(myNamedList) flatList[1:7] ``` ``` ## First1 First2 First3 Second1 Second2 Second3 Third1 ## 1 2 3 2 6 7 1 ``` --- ## Time for an exercise! Exercises on what we have covered can be found [here](../../exercises/exercises/data_types_exercise.html) --- ## Answers to exercise Answers can be found [here](../../exercises/answers/data_types_answers.html)