Introduction to R (part 1)


For More

  • This is an abridged version of our full Intro To R course
  • You can also find videos of us reviewing this course material and other courses on our Youtube channel.
  • If you have specific questions, please post an issue on our GitHub here

Set Up


Materials

All prerequisites, links to material and slides for this course can be found on github.

Or can be downloaded as a zip archive from here.

Course materials

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

  • presentations/slides/ Presentations as an HTML slide show.
  • presentations/singlepage/ Presentations as an HTML single page.
  • presentations/r_code/ R code in presentations.
  • exercises/ Practicals as HTML pages.
  • answers/ Practicals with answers as HTML pages and R code solutions.

What is R?


What is R?

R is a scripting language and environment for statistical computing.

Developed by Robert Gentleman and Ross Ihaka.

Inheriting much from S (Bell labs).

  • Suited to high level data analysis
  • Open source & cross platform
  • Extensive graphics capabilities
  • Diverse range of add-on packages
  • Active community of developers
  • Thorough documentation

What is R to you?

R comes with excellent “out-of-the-box” statistical and plotting capabilities.

R provides access to 1000s of packages (CRAN/MRAN/R-forge) which extend the basic functionality of R while maintaining high quality documentation.

In particular, Robert Gentleman developed the Bioconductor project where 100’s of packages are directly related to computational biology and analysis of associated high-throughput experiments.

R use over time

How to get R?

Freely available from R-project website.

RStudio provides an integrated development environment (IDE) which is freely available from RStudio site

We will be using RStudio and R (hopefully) already installed on your machines.

R website RStudio website

A quick tour of RStudio

Four main panels - Scripting panel - R interface - Environment and history - Files, directories and help

Let’s load RStudio and take a look

RStudio appearance

Data Types in R


Different Data types in R

  • Simple calculations
  • Variables
  • Vectors
  • Matrices (we will not cover these)
  • Data frames
  • Lists

Simple Calculations

At its most basic, R can be used as a simple calculator.

> 3+1
## [1] 4
> 2*2
## [1] 4
> sqrt(25)-1
## [1] 4

Using functions

The sqrt(25) demonstrates the use of functions in R. A function performs a complex operation on it’s arguments and returns the result.

In R, arguments are provided to a function within the parenthesis – ( ) – that follows the function name. So sqrt(ARGUMENT) will provide the square root of the value of ARGUMENT.

Other examples of functions include min(), sum(), max().

Note multiple arguments are separated by a comma.

min(2, 4, 6)
## [1] 2
sum(2, 4, 6)
## [1] 12
max(2, 4, 6)
## [1] 6

Using functions

R has many useful functions “built in” and ready to use as soon as R is loaded.

An incomplete, illustrative list can be seen here

In addition to R standard functions, additional functionality can be loaded into R using libraries. These include specialized tools for areas such as sequence alignment, read counting etc.

If you need to see how a function works try ? in front of the function name.

?sqrt

Lets run ?sqrt in RStudio and look at the help.

Using functions

Arguments have names and order

With functions such as min() and sqrt(), the arguments to be provided are obvious and the order of these arguments doesnt matter.

min(5, 4, 6)
## [1] 4
min(6, 4, 5)
## [1] 4

Many functions however have an order to their arguments. Try and look at the arguments for the dir() function using ?dir.

?dir

Using functions

Setting names for arguments

Often we know the names of arguments but not necessarily their order. In cases where we want to be sure we specify the right argument, we provide names for the arguments used.

dir()
dir(full.names=T)

This also means we don’t have to copy out all the defaults for arguments preceding it.

dir(full.names=T)
# Is equivalent to...
dir(".", NULL, FALSE, T)

Variables


Variables

As with other programming languages and even graphical calculators, R makes use of variables.

A variable stores a value as a letter or word.

In R, we make use of the assignment operator <-

x <- 10

Now x holds the value of 10

x
## [1] 10

Altering variables

x
## [1] 10

Variables can be altered in place

x <- 20
x
## [1] 20

Using variables

Variables can be used just as the values they contain.

x
## [1] 20
x + sqrt(25)
## [1] 25

Variables can be used to create new variables

y <- x + sqrt(25)
y
## [1] 25

Vectors


Vectors

In R the most basic variable or data type is a vector. A vector is an ordered collection of values. The x and y variables we have previously assigned are examples of a vector of length 1.

x
## [1] 20
length(x)
## [1] 1

Vectors

To create a multiple value vector we use the function c() to combine the supplied arguments into one vector.

x <- c(1,2,3,4,5,6,7,8,9,10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
length(x)
## [1] 10

Vectors

Vectors of continuous stretches of values can be created using a colon (:) as a shortcut.

y <- 6:10
y
## [1]  6  7  8  9 10

Indexing

Square brackets [] identify the position within a vector (the index). These indices can be used to extract relevant values from vectors.

NOTE: This vector is not made of numbers. This is called a character vector.

z <- c("a","b","c","d","e","f")
z
## [1] "a" "b" "c" "d" "e" "f"
z[1]
## [1] "a"
z[4]
## [1] "d"

Indexing

Indices can be used to extract values from multiple positions within a vector.

z[c(1,4)]
## [1] "a" "d"

Negative indices can be used to extract all positions except that specified.

z[-5]
## [1] "a" "b" "c" "d" "f"

Indexing and replacement

We can use indices to modify a specific position in a vector.

z
## [1] "a" "b" "c" "d" "e" "f"
z[5] <- "Hello"
z
## [1] "a"     "b"     "c"     "d"     "Hello" "f"

Indexing and replacement

Indices can be specified using other vectors.

y
## [1]  6  7  8  9 10
z[y] <- "Hello again"
z
##  [1] "a"           "b"           "c"           "d"           "Hello"      
##  [6] "Hello again" "Hello again" "Hello again" "Hello again" "Hello again"

Remember!

Square brackets [] for indexing.

x[1]
## [1] 1

Parentheses () for function argments.

sqrt(4)
## [1] 2

Arithmetic operations

Vectors can also be used in arithmetic operations. When a standard arithmetic operation is applied to vector, the operation is applied to each position in a vector.

x <- c(1,2,3,4,5,6,7,8,9,10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
y <- x*2
y
##  [1]  2  4  6  8 10 12 14 16 18 20

Multiple vectors can be used within arithmetic operations.

x+y
##  [1]  3  6  9 12 15 18 21 24 27 30

Arithmetic operations

When applying an arithmetic operation between two vectors of unequal length, the shorter will be recycled.

x <- c(1,2,3,4,5,6,7,8,9,10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x+c(1,2)
##  [1]  2  4  4  6  6  8  8 10 10 12
x+c(1,2,3)
## Warning in x + c(1, 2, 3): longer object length is not a multiple of shorter
## object length
##  [1]  2  4  6  5  7  9  8 10 12 11

Factors


Creating factors

A special case of a vector is a factor.

Factors are used to store data which may be grouped in categories (categorical data). Specifying data as categorical allows R to properly handle the data and make use of functions specific to categorical data.

To create a factor from a vector we use the factor() function. Note that the factor now has an additional component called “levels” which identifies all categories within the vector.

vectorExample <- c("male","female","female","female")
factorExample <- factor(vectorExample)
factorExample
## [1] male   female female female
## Levels: female male
levels(factorExample)
## [1] "female" "male"

Summary() function

An example of the use of levels can be seen from applying the summary() function to the vector and factor examples

summary(vectorExample)
##    Length     Class      Mode 
##         4 character character
summary(factorExample)
## female   male 
##      3      1

Display order of levels

In our factor example, the levels have been displayed in an alphabetical order. To adjust the display order of levels in a factor, we can supply the desired display order to levels argument in the factor() function call.

factorExample <- factor(vectorExample, levels=c("male","female"))
factorExample
## [1] male   female female female
## Levels: male female
summary(factorExample)
##   male female 
##      1      3

Nominal factors

In some cases there is no natural order to the categories such that one category is greater than the other (nominal data). By default this is not the case.

factorExample <- factor(vectorExample, levels=c("male","female"))
factorExample[1] < factorExample[2]
## Warning in Ops.factor(factorExample[1], factorExample[2]): '<' not meaningful
## for factors
## [1] NA

Ordinal factors

In other cases there will be a natural ordering to the categories (ordinal data). A factor can be specified to be ordered using the ordered argument in combination with specified levels argument.

factorExample <- factor(c("small","big","big","small"),
                        ordered=TRUE,levels=c("small","big"))
factorExample
## [1] small big   big   small
## Levels: small < big
factorExample[1] < factorExample[2]
## [1] TRUE

Replacement

Unlike vectors, replacing elements within a factor isn’t so easy. While replacing one element with an established level is possible, replacing with a novel element will result in a warning.

factorExample <- factor(c("small","big","big","small"))
factorExample[1] <- c("big")
factorExample
## [1] big   big   big   small
## Levels: big small
factorExample[1] <- c("huge")
## Warning in `[<-.factor`(`*tmp*`, 1, value = "huge"): invalid factor level, NA
## generated
factorExample
## [1] <NA>  big   big   small
## Levels: big small

Replacement

To add a new level we can use the levels argument.

levels(factorExample) <- c("big","small","huge")
factorExample[1] <- c("huge")
factorExample
## [1] huge  big   big   small
## Levels: big small huge

Data Frames


All progamming languages have a concept of a table. In R, the most useful type is a data frame.

igv

Creating data frames

In R, we make use of the data frame object which allows us to store tables with columns of different data types. To create a data frame we can simply use the data.frame() function.

patientName <- c("patient1","patient2","patient3","patient4")
patientType <- factor(rep(c("male","female"),2))
survivalTime <- c(1,30,2,20)
dfExample <- data.frame(Name=patientName, Type=patientType, Survival_Time=survivalTime)
dfExample
##       Name   Type Survival_Time
## 1 patient1   male             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20

Indexing

Selecting and replacing portions of a data frame can be done by indexing using square brackets [] much like for vectors.

When indexing data frames, two values may be provided within the square brackets separated by a comma to retrieve information on a data frame position.

The first value(s) corresponds to row(s) and the second to column(s).

  • myDataFrame[rowOfInterest,columnOfInterest]

Indexing

dfExample
##       Name   Type Survival_Time
## 1 patient1   male             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20

Value of first column, second row

dfExample[2,1]
## [1] "patient2"

Indexing

Similarly, whole rows or columns can be extracted. Single rows and columns will return a vector.

Values of second column (row index is empty!)

dfExample[,2]
## [1] male   female male   female
## Levels: female male

Values of third row (column index is empty!)

dfExample[3,]
##       Name Type Survival_Time
## 3 patient3 male             2

Indexing

When multiple columns or row indices are specified, a new data frame is returned.

Values of second and third row (column index is empty!)

dfExample[c(2,3),]
##       Name   Type Survival_Time
## 2 patient2 female            30
## 3 patient3   male             2

Indexing and replacement

Replacement can occur in the same way we have seen for less complex data types: use the assignment operator <-

dfExample[1,3] <- "Forever"
dfExample
##       Name   Type Survival_Time
## 1 patient1   male       Forever
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20

Indexing and replacement

When we work with factors, for a replacement to be successful it has to be a possible level.

dfExample[1,"Type"] <- "other"
## Warning in `[<-.factor`(`*tmp*`, iseq, value = "other"): invalid factor level,
## NA generated
dfExample
##       Name   Type Survival_Time
## 1 patient1   <NA>       Forever
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20

Factors in data frames

It is possible to update factors in data frames just as with standard factors.

dfExample <- data.frame(Name=patientName,Type=patientType,
                        Survival_Time=survivalTime)

levels(dfExample[,"Type"]) <- c(levels(dfExample[,"Type"]) ,
                                "other")

dfExample[1,"Type"] <- "other"
dfExample
##       Name   Type Survival_Time
## 1 patient1  other             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20

Specify columns with $

It is ALSO possible to index a data frames column by using the $ symbol.

dfExample <- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
dfExample$Survival_Time
## [1]  1 30  2 20

Create columns with $

The $ operator also allows for the creation of new columns for a data frame on the fly.

dfExample
##       Name   Type Survival_Time
## 1 patient1   male             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20
dfExample$newColumn <- rep("newData",nrow(dfExample))
dfExample
##       Name   Type Survival_Time newColumn
## 1 patient1   male             1   newData
## 2 patient2 female            30   newData
## 3 patient3   male             2   newData
## 4 patient4 female            20   newData

Finding dimensions

To find dimensions of a data frame, the dim() function will provide dimensions as the row then column number while nrow() and ncol() will return just row number and column number respectively.

dim(dfExample)
## [1] 4 4
nrow(dfExample)
## [1] 4
ncol(dfExample)
## [1] 4

Names

The functions colnames() and rownames() can be used to interact with the names. We can access them by simply using the function or update them using assignment.

colnames(dfExample)
## [1] "Name"          "Type"          "Survival_Time" "newColumn"
colnames(dfExample)[1] <- "PatientID"
dfExample
##   PatientID   Type Survival_Time newColumn
## 1  patient1   male             1   newData
## 2  patient2 female            30   newData
## 3  patient3   male             2   newData
## 4  patient4 female            20   newData

Joining vectors and data frames

A data frame can be created from multiple vectors or other data frames.

cbind() can be used to attach data to a data frame as columns.

x <- 1:4
newDF <- cbind(x,dfExample)
newDF
##   x PatientID   Type Survival_Time newColumn
## 1 1  patient1   male             1   newData
## 2 2  patient2 female            30   newData
## 3 3  patient3   male             2   newData
## 4 4  patient4 female            20   newData

Joining vectors and data frames

rbind() functions to bind to a data frame as rows.

z <- c(5,"patient5","male",25)
newerDF <- rbind(newDF,z)
## Warning in rbind(deparse.level, ...): number of columns of result, 5, is not a
## multiple of vector length 4 of arg 2
newerDF
##   x PatientID   Type Survival_Time newColumn
## 1 1  patient1   male             1   newData
## 2 2  patient2 female            30   newData
## 3 3  patient3   male             2   newData
## 4 4  patient4 female            20   newData
## 5 5  patient5   male            25         5

Lists


Creating lists

Lists are the final data type we will look at.

In R, lists provide a general container which can hold any data type.

firstElement <- 10
secondElement <- c("a","b","c","d")
thirdElement <- data.frame(colOne=c(1,2,4,5),colTwo=c("One","Two","Three","Four"))

lists

To create a list we can simply use the list() function with arguments specifying the data we wish to include in the list.

myList <- list(firstElement, secondElement, thirdElement)
myList
## [[1]]
## [1] 10
## 
## [[2]]
## [1] "a" "b" "c" "d"
## 
## [[3]]
##   colOne colTwo
## 1      1    One
## 2      2    Two
## 3      4  Three
## 4      5   Four

Indexing

List, as with other data types in R, can be indexed. In contrast to other types, using [] on a list will subset the list to another list of selected indices. To retrieve an element from a list in R , two square brackets [[]] must be used.

myList <- list(firstElement,secondElement,thirdElement)
myList[1]
## [[1]]
## [1] 10
myList[[1]]
## [1] 10

Joining lists

Again, similar to vectors, lists can be joined together in R using the c() function

myNamedList <- list(First=firstElement,Second=secondElement,
                    Third=thirdElement)
myNamedList <- c(myNamedList,list(fourth=c(4,4)))
myNamedList[c(1,4)]
## $First
## [1] 10
## 
## $fourth
## [1] 4 4

Flattening lists

Sometimes you will wish to “flatten” out a list. When a list contains compatible objects, i.e. list of all one type, the unlist() function can be used.

myNamedList <- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
myNamedList
## $First
## [1] 1 2 3
## 
## $Second
## [1] 2 6 7
## 
## $Third
## [1] 1 4 7
flatList <- unlist(myNamedList)
flatList[1:7]
##  First1  First2  First3 Second1 Second2 Second3  Third1 
##       1       2       3       2       6       7       1

Time for an exercise!

Exercises on what we have covered can be found here

Answers to exercise

Answers can be found here