All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
R is a scripting language and environment for statistical computing.
Developed by Robert Gentleman and Ross Ihaka.
Inheriting much from S (Bell labs).
R comes with excellent “out-of-the-box” statistical and plotting capabilities.
R provides access to 1000s of packages (CRAN/MRAN/R-forge) which extend the basic functionality of R while maintaining high quality documentation.
In particular, Robert Gentleman developed the Bioconductor project where 100’s of packages are directly related to computational biology and analysis of associated high-throughput experiments.
Freely available from R-project website.
RStudio provides an integrated development environment (IDE) which is freely available from RStudio site
We will be using RStudio and R (hopefully) already installed on your machines.
Four main panels - Scripting panel - R interface - Environment and history - Files, directories and help
Let’s load RStudio and take a look
At its most basic, R can be used as a simple calculator.
> 3+1
## [1] 4
> 2*2
## [1] 4
> sqrt(25)-1
## [1] 4
The sqrt(25) demonstrates the use of functions in R. A function performs a complex operation on it’s arguments and returns the result.
In R, arguments are provided to a function within the parenthesis – ( ) – that follows the function name. So sqrt(ARGUMENT) will provide the square root of the value of ARGUMENT.
Other examples of functions include min(), sum(), max().
Note multiple arguments are separated by a comma.
min(2, 4, 6)
## [1] 2
sum(2, 4, 6)
## [1] 12
max(2, 4, 6)
## [1] 6
R has many useful functions “built in” and ready to use as soon as R is loaded.
An incomplete, illustrative list can be seen here
In addition to R standard functions, additional functionality can be loaded into R using libraries. These include specialized tools for areas such as sequence alignment, read counting etc.
If you need to see how a function works try ? in front of the function name.
?sqrt
Lets run ?sqrt in RStudio and look at the help.
Arguments have names and order
With functions such as min() and sqrt(), the arguments to be provided are obvious and the order of these arguments doesnt matter.
min(5, 4, 6)
## [1] 4
min(6, 4, 5)
## [1] 4
Many functions however have an order to their arguments. Try and look at the arguments for the dir() function using ?dir.
?dir
Setting names for arguments
Often we know the names of arguments but not necessarily their order. In cases where we want to be sure we specify the right argument, we provide names for the arguments used.
dir()
dir(full.names=T)
This also means we don’t have to copy out all the defaults for arguments preceding it.
dir(full.names=T)
# Is equivalent to...
dir(".", NULL, FALSE, T)
As with other programming languages and even graphical calculators, R makes use of variables.
A variable stores a value as a letter or word.
In R, we make use of the assignment operator <-
<- 10 x
Now x holds the value of 10
x
## [1] 10
x
## [1] 10
Variables can be altered in place
<- 20
x x
## [1] 20
Variables can be used just as the values they contain.
x
## [1] 20
+ sqrt(25) x
## [1] 25
Variables can be used to create new variables
<- x + sqrt(25)
y y
## [1] 25
In R the most basic variable or data type is a vector. A vector is an ordered collection of values. The x and y variables we have previously assigned are examples of a vector of length 1.
x
## [1] 20
length(x)
## [1] 1
To create a multiple value vector we use the function c() to combine the supplied arguments into one vector.
<- c(1,2,3,4,5,6,7,8,9,10)
x x
## [1] 1 2 3 4 5 6 7 8 9 10
length(x)
## [1] 10
Vectors of continuous stretches of values can be created using a colon (:) as a shortcut.
<- 6:10
y y
## [1] 6 7 8 9 10
Square brackets [] identify the position within a vector (the index). These indices can be used to extract relevant values from vectors.
NOTE: This vector is not made of numbers. This is called a character vector.
<- c("a","b","c","d","e","f")
z z
## [1] "a" "b" "c" "d" "e" "f"
1] z[
## [1] "a"
4] z[
## [1] "d"
Indices can be used to extract values from multiple positions within a vector.
c(1,4)] z[
## [1] "a" "d"
Negative indices can be used to extract all positions except that specified.
-5] z[
## [1] "a" "b" "c" "d" "f"
We can use indices to modify a specific position in a vector.
z
## [1] "a" "b" "c" "d" "e" "f"
5] <- "Hello"
z[ z
## [1] "a" "b" "c" "d" "Hello" "f"
Indices can be specified using other vectors.
y
## [1] 6 7 8 9 10
<- "Hello again"
z[y] z
## [1] "a" "b" "c" "d" "Hello"
## [6] "Hello again" "Hello again" "Hello again" "Hello again" "Hello again"
Square brackets [] for indexing.
1] x[
## [1] 1
Parentheses () for function argments.
sqrt(4)
## [1] 2
Vectors can also be used in arithmetic operations. When a standard arithmetic operation is applied to vector, the operation is applied to each position in a vector.
<- c(1,2,3,4,5,6,7,8,9,10)
x x
## [1] 1 2 3 4 5 6 7 8 9 10
<- x*2
y y
## [1] 2 4 6 8 10 12 14 16 18 20
Multiple vectors can be used within arithmetic operations.
+y x
## [1] 3 6 9 12 15 18 21 24 27 30
When applying an arithmetic operation between two vectors of unequal length, the shorter will be recycled.
<- c(1,2,3,4,5,6,7,8,9,10)
x x
## [1] 1 2 3 4 5 6 7 8 9 10
+c(1,2) x
## [1] 2 4 4 6 6 8 8 10 10 12
+c(1,2,3) x
## Warning in x + c(1, 2, 3): longer object length is not a multiple of shorter
## object length
## [1] 2 4 6 5 7 9 8 10 12 11
A special case of a vector is a factor.
Factors are used to store data which may be grouped in categories (categorical data). Specifying data as categorical allows R to properly handle the data and make use of functions specific to categorical data.
To create a factor from a vector we use the factor() function. Note that the factor now has an additional component called “levels” which identifies all categories within the vector.
<- c("male","female","female","female")
vectorExample <- factor(vectorExample)
factorExample factorExample
## [1] male female female female
## Levels: female male
levels(factorExample)
## [1] "female" "male"
An example of the use of levels can be seen from applying the summary() function to the vector and factor examples
summary(vectorExample)
## Length Class Mode
## 4 character character
summary(factorExample)
## female male
## 3 1
In our factor example, the levels have been displayed in an alphabetical order. To adjust the display order of levels in a factor, we can supply the desired display order to levels argument in the factor() function call.
<- factor(vectorExample, levels=c("male","female"))
factorExample factorExample
## [1] male female female female
## Levels: male female
summary(factorExample)
## male female
## 1 3
In some cases there is no natural order to the categories such that one category is greater than the other (nominal data). By default this is not the case.
<- factor(vectorExample, levels=c("male","female"))
factorExample 1] < factorExample[2] factorExample[
## Warning in Ops.factor(factorExample[1], factorExample[2]): '<' not meaningful
## for factors
## [1] NA
In other cases there will be a natural ordering to the categories (ordinal data). A factor can be specified to be ordered using the ordered argument in combination with specified levels argument.
<- factor(c("small","big","big","small"),
factorExample ordered=TRUE,levels=c("small","big"))
factorExample
## [1] small big big small
## Levels: small < big
1] < factorExample[2] factorExample[
## [1] TRUE
Unlike vectors, replacing elements within a factor isn’t so easy. While replacing one element with an established level is possible, replacing with a novel element will result in a warning.
<- factor(c("small","big","big","small"))
factorExample 1] <- c("big")
factorExample[ factorExample
## [1] big big big small
## Levels: big small
1] <- c("huge") factorExample[
## Warning in `[<-.factor`(`*tmp*`, 1, value = "huge"): invalid factor level, NA
## generated
factorExample
## [1] <NA> big big small
## Levels: big small
To add a new level we can use the levels argument.
levels(factorExample) <- c("big","small","huge")
1] <- c("huge")
factorExample[ factorExample
## [1] huge big big small
## Levels: big small huge
All progamming languages have a concept of a table. In R, the most useful type is a data frame.
In R, we make use of the data frame object which allows us to store tables with columns of different data types. To create a data frame we can simply use the data.frame() function.
<- c("patient1","patient2","patient3","patient4")
patientName <- factor(rep(c("male","female"),2))
patientType <- c(1,30,2,20)
survivalTime <- data.frame(Name=patientName, Type=patientType, Survival_Time=survivalTime)
dfExample dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
Selecting and replacing portions of a data frame can be done by indexing using square brackets [] much like for vectors.
When indexing data frames, two values may be provided within the square brackets separated by a comma to retrieve information on a data frame position.
The first value(s) corresponds to row(s) and the second to column(s).
dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
Value of first column, second row
2,1] dfExample[
## [1] "patient2"
Similarly, whole rows or columns can be extracted. Single rows and columns will return a vector.
Values of second column (row index is empty!)
2] dfExample[,
## [1] male female male female
## Levels: female male
Values of third row (column index is empty!)
3,] dfExample[
## Name Type Survival_Time
## 3 patient3 male 2
When multiple columns or row indices are specified, a new data frame is returned.
Values of second and third row (column index is empty!)
c(2,3),] dfExample[
## Name Type Survival_Time
## 2 patient2 female 30
## 3 patient3 male 2
Replacement can occur in the same way we have seen for less complex data types: use the assignment operator <-
1,3] <- "Forever"
dfExample[ dfExample
## Name Type Survival_Time
## 1 patient1 male Forever
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
When we work with factors, for a replacement to be successful it has to be a possible level.
1,"Type"] <- "other" dfExample[
## Warning in `[<-.factor`(`*tmp*`, iseq, value = "other"): invalid factor level,
## NA generated
dfExample
## Name Type Survival_Time
## 1 patient1 <NA> Forever
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
It is possible to update factors in data frames just as with standard factors.
<- data.frame(Name=patientName,Type=patientType,
dfExample Survival_Time=survivalTime)
levels(dfExample[,"Type"]) <- c(levels(dfExample[,"Type"]) ,
"other")
1,"Type"] <- "other"
dfExample[ dfExample
## Name Type Survival_Time
## 1 patient1 other 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
It is ALSO possible to index a data frames column by using the $ symbol.
<- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
dfExample $Survival_Time dfExample
## [1] 1 30 2 20
The $ operator also allows for the creation of new columns for a data frame on the fly.
dfExample
## Name Type Survival_Time
## 1 patient1 male 1
## 2 patient2 female 30
## 3 patient3 male 2
## 4 patient4 female 20
$newColumn <- rep("newData",nrow(dfExample))
dfExample dfExample
## Name Type Survival_Time newColumn
## 1 patient1 male 1 newData
## 2 patient2 female 30 newData
## 3 patient3 male 2 newData
## 4 patient4 female 20 newData
To find dimensions of a data frame, the dim() function will provide dimensions as the row then column number while nrow() and ncol() will return just row number and column number respectively.
dim(dfExample)
## [1] 4 4
nrow(dfExample)
## [1] 4
ncol(dfExample)
## [1] 4
The functions colnames() and rownames() can be used to interact with the names. We can access them by simply using the function or update them using assignment.
colnames(dfExample)
## [1] "Name" "Type" "Survival_Time" "newColumn"
colnames(dfExample)[1] <- "PatientID"
dfExample
## PatientID Type Survival_Time newColumn
## 1 patient1 male 1 newData
## 2 patient2 female 30 newData
## 3 patient3 male 2 newData
## 4 patient4 female 20 newData
A data frame can be created from multiple vectors or other data frames.
cbind() can be used to attach data to a data frame as columns.
<- 1:4
x <- cbind(x,dfExample)
newDF newDF
## x PatientID Type Survival_Time newColumn
## 1 1 patient1 male 1 newData
## 2 2 patient2 female 30 newData
## 3 3 patient3 male 2 newData
## 4 4 patient4 female 20 newData
rbind() functions to bind to a data frame as rows.
<- c(5,"patient5","male",25)
z <- rbind(newDF,z) newerDF
## Warning in rbind(deparse.level, ...): number of columns of result, 5, is not a
## multiple of vector length 4 of arg 2
newerDF
## x PatientID Type Survival_Time newColumn
## 1 1 patient1 male 1 newData
## 2 2 patient2 female 30 newData
## 3 3 patient3 male 2 newData
## 4 4 patient4 female 20 newData
## 5 5 patient5 male 25 5
Lists are the final data type we will look at.
In R, lists provide a general container which can hold any data type.
<- 10
firstElement <- c("a","b","c","d")
secondElement <- data.frame(colOne=c(1,2,4,5),colTwo=c("One","Two","Three","Four")) thirdElement
To create a list we can simply use the list() function with arguments specifying the data we wish to include in the list.
<- list(firstElement, secondElement, thirdElement)
myList myList
## [[1]]
## [1] 10
##
## [[2]]
## [1] "a" "b" "c" "d"
##
## [[3]]
## colOne colTwo
## 1 1 One
## 2 2 Two
## 3 4 Three
## 4 5 Four
List, as with other data types in R, can be indexed. In contrast to other types, using [] on a list will subset the list to another list of selected indices. To retrieve an element from a list in R , two square brackets [[]] must be used.
<- list(firstElement,secondElement,thirdElement)
myList 1] myList[
## [[1]]
## [1] 10
1]] myList[[
## [1] 10
Again, similar to vectors, lists can be joined together in R using the c() function
<- list(First=firstElement,Second=secondElement,
myNamedList Third=thirdElement)
<- c(myNamedList,list(fourth=c(4,4)))
myNamedList c(1,4)] myNamedList[
## $First
## [1] 10
##
## $fourth
## [1] 4 4
Sometimes you will wish to “flatten” out a list. When a list contains compatible objects, i.e. list of all one type, the unlist() function can be used.
<- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
myNamedList myNamedList
## $First
## [1] 1 2 3
##
## $Second
## [1] 2 6 7
##
## $Third
## [1] 1 4 7
<- unlist(myNamedList)
flatList 1:7] flatList[
## First1 First2 First3 Second1 Second2 Second3 Third1
## 1 2 3 2 6 7 1
Exercises on what we have covered can be found here
Answers can be found here