Introduction to R - Session 1

.title[
# Introduction to R - Session 1
]
.subtitle[
## Bioinformatics Resource Center - Rockefeller University
]
.author[
### <a href="http://rockefelleruniversity.github.io/Intro_To_R_1Day/" class="uri">http://rockefelleruniversity.github.io/Intro_To_R_1Day/</a>
]
.author[
### <a href="mailto:brc@rockefeller.edu" class="email">brc@rockefeller.edu</a>
]

---

## Overview

- [Course Home Page](http://rockefelleruniversity.github.io/Intro_To_R_1Day/)
- [Set up](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#set-up)
- [Background to R](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#background-to-r)
- [Data types in R](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#data_types_in_r)
- [Reading and writing in R](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#reading-and-writing-data-in-r)

---
class: inverse, center, middle

# Set Up
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> 
---

## Materials

All prerequisites, links to material and slides for this course can be found on github.

* [Intro_To_R_1](https://rockefelleruniversity.github.io/Intro_To_R_1Day/)

Or can be downloaded as a zip archive from here.

* [Download zip](https://github.com/rockefelleruniversity/Intro_To_R_1Day/zipball/master)

---
## Course materials

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

* **r_course/presentations/slides/**
Presentations as an HTML slide show.
* **r_course/presentations/singlepage/** 
Presentations as an HTML single page.
* **r_course/presentations/r_code/**
R code in presentations.
* **r_course/exercises/**
Practicals as HTML pages. 
* **r_course/answers/**
Practicals with answers as HTML pages and R code solutions.

---
class: inverse, center, middle

# Background of R

---

## What is R?

**R** is a scripting language and environment for **statistical computing**.

Developed by [Robert Gentleman](https://en.wikipedia.org/wiki/Robert_Gentleman_%28statistician%29) and [Ross Ihaka](https://en.wikipedia.org/wiki/Ross_Ihaka).

Inheriting much from **S** (Bell labs).

- Suited to high level data analysis
- Open source & cross platform
- Extensive graphics capabilities
- Diverse range of add-on packages
- Active community of developers
- Thorough documentation

---
## What is R to you?

**R** comes with excellent "out-of-the-box" statistical and plotting capabilities.

**R** provides access to 1000s of packages ([CRAN](http://cran.r-project.org/)/[MRAN](http://mran.revolutionanalytics.com/)/[R-forge](https://r-forge.r-project.org/)) which extend the basic functionality of R while maintaining high quality documentation.

In particular, [Robert Gentleman](https://en.wikipedia.org/wiki/Robert_Gentleman_%28statistician%29) developed the **[Bioconductor](http://bioconductor.org/)** project where 100's of packages are directly related to computational biology and analysis of associated high-throughput experiments.
  ] 
.pull-right[

R package downloads from Bioconductor

![R use over time](imgs/bioconductor_packages.png)

]

---
## How to get R?

Freely available from [R-project website](http://cran.ma.imperial.ac.uk/).

RStudio provides an integrated development environment (IDE) which is freely available from [RStudio site](http://www.rstudio.com/)

***We will be using RStudio and R already installed on your machines.***
  ]
.pull-right[
![R website](imgs/cran.jpeg)
![RStudio website](imgs/rstudio.jpeg)
  ]

---
## A quick tour of RStudio

.pull-left[
Four main panels
- Scripting panel
- R interface
- Environment and history
- Files, directories and help

**Let's load RStudio and take a look**
  ]

![RStudio appearance](imgs/rstudioBlank.jpeg)

]

---

# Data Types in R

---

## Different Data types in R

- Simple calculations
- Variables
- Vectors
- Lists
- Matrices
- Data frames

---
## Simple Calculations

At its most basic, **R** can be used as a simple calculator.

``` r
> 3+1
```

```
## [1] 4
```

``` r
> 2*2
```

```
## [1] 4
```

``` r
> sqrt(25)-1
```

```
## [1] 4
```

---
## Using functions

The **sqrt(25)** demonstrates the use of functions in R. A function performs a complex operation on it's arguments and returns the result.

In R, arguments are provided to a function within the parenthesis -- **( )** -- that follows the function name. So **sqrt(*ARGUMENT*)** will provide the square root of the value of ***ARGUMENT***.

Other examples of functions include **min()**, **sum()**, **max()**.

Note multiple arguments are separated by a comma.

``` r
min(2,4,6)
```

```
## [1] 2
```

``` r
sum(2,4,6)
```

```
## [1] 12
```

``` r
max(2,4,6)
```

```
## [1] 6
```

---
## Using functions

R has many useful functions "built in" and ready to use as soon as R is loaded.

An incomplete, illustrative list can be seen [here](http://www.statmethods.net/management/functions.html)

In addition to R standard functions, additional functionality can be loaded into R using libraries. These include specialised tools for areas such as sequence alignment, read counting etc.

If you need to see how a function works try **?** in front of the function name.

``` r
?sqrt
```

Lets run [**?sqrt**](https://stat.ethz.ch/R-manual/R-devel/library/base/html/MathFun.html) in RStudio and look at the help.

---
## Using functions
**Arguments have names and order**

With functions such as min() and sqrt(), the arguments to be provided are obvious and the order of these arguments doesnt matter.

``` r
min(5,4,6)
```

```
## [1] 4
```

``` r
min(6,4,5)
```

```
## [1] 4
```

Many functions however have an order to their arguments.
Try and look at the arguments for the dir() function using [?dir](https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.files.html).

```
?dir
```

---
## Using functions
**Setting names for arguments**

Often we know the names of arguments but not necessarily their order.
In cases where we want to be sure we specify the right argument, we provide names for the arguments used.

``` r
dir()
dir(full.names=T)
```

This also means we don't have to copy out all the defaults for arguments preceeding it.

``` r
dir(full.names=T)
# Is equivalent to...
dir(".",NULL,FALSE,T)
```

---
class: inverse, center, middle

# Variables

---

## Variables

As with other programming languages and even graphical calculators, **R** makes use of **variables**.

A **variable** stores a value as a letter or word.

In **R**, we make use of the assignment operator **<-**

``` r
x <- 10
```
Now **x** holds the value of 10

``` r
x
```

```
## [1] 10
```

---
## Altering variables

``` r
x
```

```
## [1] 10
```

Variables can be altered in place

``` r
x <- 20
x
```

```
## [1] 20
```

---

## Using variables

Variables can be used just as the values they contain.

``` r
x
```

```
## [1] 20
```

``` r
x + sqrt(25)
```

```
## [1] 25
```
Variables can be used to create new variables

``` r
y <- x + sqrt(25)
y
```

```
## [1] 25
```

---
class: inverse, center, middle

# Vectors

---

## Vectors

In **R** the most basic variable or data type is a **vector**. A vector is an ordered collection of values. The x and y variables we have previously assigned are examples of a vector of length 1.

``` r
x
```

```
## [1] 20
```

``` r
length(x)
```

```
## [1] 1
```

---
## Vectors

To create a multiple value vector we use the function **c()** to *combine* the supplied arguments into one vector.

``` r
x <- c(1,2,3,4,5,6,7,8,9,10)
x
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

``` r
length(x)
```

```
## [1] 10
```

---
## Vectors

Vectors of continuous stretches of values can be created using a colon (**:**) as a shortcut.

``` r
y <- 6:10
y
```

```
## [1]  6  7  8  9 10
```

Other useful function to create stretchs of numeric vectors are **seq()** and **rep()**.
The **seq()** function creates a sequence of numeric values from a specified start and end value, incrementing by a user defined amount. The **rep()** function repeats a variable a user-defined number of times.

``` r
seq(from=1,to=5,by=2)
```

```
## [1] 1 3 5
```

``` r
rep(c(1,5,10),3)
```

```
## [1]  1  5 10  1  5 10  1  5 10
```

---
## Indexing

Square brackets **[]** identify the position within a vector (the **index**).
These indices can be used to extract relevant values from vectors.

``` r
z <- seq(from=2,to=20,by=2) 
z
```

```
##  [1]  2  4  6  8 10 12 14 16 18 20
```

``` r
z[1]
```

```
## [1] 2
```

``` r
z[8]
```

```
## [1] 16
```

---
## Indexing

Indices can be used to extract values from multiple positions within a vector.

``` r
z[c(1,6)]
```

```
## [1]  2 12
```
Negative indices can be used to extract all positions except that specified.

``` r
z[-5]
```

```
## [1]  2  4  6  8 12 14 16 18 20
```

---
## Indexing and replacement

We can use indices to modify a specific position in a vector.

``` r
z
```

```
##  [1]  2  4  6  8 10 12 14 16 18 20
```

``` r
z[5] <- 1000
z
```

```
##  [1]    2    4    6    8 1000   12   14   16   18   20
```

---
## Indexing and replacement

Indices can be specified using other vectors.

``` r
y
```

```
## [1]  6  7  8  9 10
```

``` r
z[y] <- 0
z
```

```
##  [1]    2    4    6    8 1000    0    0    0    0    0
```

---
## Remember!

Square brackets **[]**  for indexing.

``` r
x[1]
```

```
## [1] 1
```

Parentheses **()**  for function argments.

``` r
sqrt(4)
```

```
## [1] 2
```

---
## Arithmetic operations

Vectors in R can be used in arithmetic operations as seen with [variables earlier](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Using_variables). When a standard arithmetic operation is applied to vector, the operation is applied to each position in a vector.

``` r
x <- c(1,2,3,4,5,6,7,8,9,10)
x
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

``` r
y <- x*2
y
```

```
##  [1]  2  4  6  8 10 12 14 16 18 20
```

Multiple vectors can be used within arithmetic operations.

``` r
x+y
```

```
##  [1]  3  6  9 12 15 18 21 24 27 30
```

---
## Arithmetic operations

When applying an arithmetic operation between two vectors of unequal length, the shorter will be recycled.

``` r
x <- c(1,2,3,4,5,6,7,8,9,10)
x
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

``` r
x+c(1,2)
```

```
##  [1]  2  4  4  6  6  8  8 10 10 12
```

``` r
x+c(1,2,3)
```

```
## Warning in x + c(1, 2, 3): longer object length is not a multiple of shorter
## object length
```

```
##  [1]  2  4  6  5  7  9  8 10 12 11
```

---
## R and messages

When R gives you an unexpected response DON'T PANIC.

You will run into errors coding.

Most of the time the messages are very clear. And if they are not, a quick google will often clear it up:

```
Warning in x + c(1, 2, 3) :
  longer object length is not a multiple of shorter object length
```

---
## Messages, Warnings, Errors

There are several kinds of messages.

* Message - Just a text message - This can denote progress or options used.

* Warning - Just a text message starting with **"Warning"** - Often denotes if you have done something atypical. You should review the message unless you did it on purpose.

* Error - The code will stop and a text message starting with **"Error"** - Something broke the code. Review the message.

---
## Character vectors

So far we have only looked at numeric vectors or variables.

In R we can also create character vectors [again using **c()** function](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#vectors22). These vectors can be indexed just the same.

``` r
y <- c("ICTEM","CommonWealth","Wolfson")
y[2]
```

```
## [1] "CommonWealth"
```

Character vectors can be used to assign names to other vectors.

``` r
x <- c(1:3)
names(x) <- y
x
```

```
##        ICTEM CommonWealth      Wolfson 
##            1            2            3
```

---
## Character vectors as names

These named vectors maybe indexed by a position's "name".

``` r
x[c("ICTEM","Wolfson")]
```

```
##   ICTEM Wolfson 
##       1       3
```
  
Index names missing from vectors will return special value "NA".

``` r
x[c("Strand")]
```

```
## <NA> 
##   NA
```

---
## A note on NA values

In R, like many languages, when a value in a variable is missing, the value is assigned a **NA** value.

Similarly, when a calculation can not be perfomed, R will input a **NaN** value.

- **NA** - Not Available.
- **NaN** - Not A Number.

**NA** values allow for R to handle missing data correctly as they require different handling than standard numeric or character values. We will illustrate an example handling **NA** values [later](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Ordering_with_NA_values).

---
## The unique() function

The unique() function can be used to retrieve all unique  values from a vector.

``` r
geneList <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene1","Gene3")
unique(geneList)
```

```
## [1] "Gene1" "Gene2" "Gene3" "Gene4" "Gene5"
```

---
## Logical vectors

Logical vectors are a class of vector made up of TRUE or FALSE boolean values (single letter T/F can also be used).

``` r
z <-  c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE) 
# or
z <- c(T,F,T,F,T,F,T,F,T,F)

z
```

```
##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
```
Logical vectors can be used like an index to specify postions in a vector. TRUE values will return the corresponding position in the vector being indexed.

``` r
x <- 1:10
x[z]
```

```
## [1] 1 3 5 7 9
```

---
## The %in% operator

A common task in R is to subset one vector by the values in another vector.

The **%in%** operator in the context **A %in% B** creates a logical vector of whether values in **A** matches any values in of **B**.

This can be then used to subset the values within one character vector by a those in a second.

``` r
geneList <- c("Gene1","Gene2","Gene3","Gene4","Gene5","Gene1","Gene3")
secondGeneList <- c("Gene5","Gene3")
logical_index <- geneList %in% secondGeneList
logical_index
```

```
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
```

``` r
geneList[logical_index]
```

```
## [1] "Gene3" "Gene5" "Gene3"
```

---
## The *grepl()* function

Another more flexible way of making a logical vector for indexing is by using the grepl function. This uses *regular expressions* so you can do more advanced pattern matching.

We simply provide the pattern we are looking for and the vector we are looking in.

``` r
mixedList <- c("protein1","Gene1","Protein2","Gene3","Gene4","Protein4","Gene5","Gene1","Protein5")
logical_index <- grepl("Prot", mixedList)
logical_index
```

```
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
```

``` r
mixedList[logical_index]
```

```
## [1] "Protein2" "Protein4" "Protein5"
```

---
## The *grepl()* function

We can also control case sensitivity with the *ignore.case=T*.

``` r
logical_index <- grepl("Prot", mixedList, ignore.case=T)
logical_index
```

```
## [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
```

``` r
mixedList[logical_index]
```

```
## [1] "protein1" "Protein2" "Protein4" "Protein5"
```

There is also a related *grep()* function which returns the numeric index in which a pattern was matched.

``` r
grep("Prot", mixedList)
```

```
## [1] 3 6 9
```

---
## Logical vectors from operators

Vectors may also be directly evaluated to produce logical vectors. This can be very useful when using a logical to index.

Common examples are:

- **==**  evaluates as equal.
- **>** and **<** evaluates as greater or less than respectively.
- **>=** and **<=** evaluates as greater than or equal or less than or equal respectively.

``` r
x <- 1:10
x > 5
```

```
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
```

``` r
x[x > 5]
```

```
## [1]  6  7  8  9 10
```

---
## Combining logical vectors

Logical vectors can be used in combination in order to index vectors. To combine logical vectors we can use some common R operators.

- **&** - Requires both logical operators to be TRUE
- **|** - Requires either logical operator to be TRUE.
- **!** - Reverses the logical operator, so TRUE is FALSE and FALSE is TRUE.

``` r
x <- 1:10
!x > 4
```

```
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
```

``` r
x > 4 & x < 7
```

```
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
```

``` r
x > 4 | x < 7
```

```
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
```

---
## Combining logical vectors

Such combinations can allow for complex selection of a vector's values.

``` r
x <- 1:10
x
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

``` r
x[x > 4 & x < 7]
```

```
## [1] 5 6
```

``` r
x[x > 4 & !x < 7]
```

```
## [1]  7  8  9 10
```

---
## Time for an exercise!

Exercise on vectors can be found [here](../../exercises/exercises/Vectors_exercise.html)

---
## Answers to exercise

Answers can be found here  [here](../../exercises/answers/Vectors_answers.html)

---

# Matrices

---

## Creating matrices

In programs such as Excel we are used to tables.

---
## Creating matrices

All progamming languages have a concept of a table. In **R**, the most basic table type is a **matrix**.

A **matrix** can be created using the ***matrix()*** function with the arguments of **nrow** and **ncol** specifying the number of rows and columns respectively.

``` r
narrowMatrix <- matrix(1:10, nrow=5, ncol=2)
narrowMatrix
```

```
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
```

``` r
wideMatrix <- matrix(1:10, nrow=2, ncol=5)
wideMatrix
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

---
## Creating matrices

By default when creating a matrix using the **matrix** function, the values fill the matrix by columns. To fill a matrix by rows the **byrow** argument must be set to TRUE.

``` r
wideMatrix <- matrix(1:10, nrow=2, ncol=5)
wideMatrix
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

``` r
wideMatrixByRow <- matrix(1:10, nrow=2, ncol=5, byrow=TRUE)
wideMatrixByRow
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
```

---
## Finding dimensions

To find dimensions of a matrix, the **dim()** function will provide dimensions as the row then column number while **nrow()** and **ncol()** will return just row number and column number respectively.

``` r
dim(narrowMatrix)
```

```
## [1] 5 2
```

``` r
nrow(narrowMatrix)
```

```
## [1] 5
```

``` r
ncol(narrowMatrix)
```

```
## [1] 2
```

---
## Joining vectors and matrices

A matrix can be created from multiple vectors or other matrices.

**cbind()** can be used to attach data to a matrix as columns.

``` r
x <- 1:5
y <- 11:15
z <- 21:22
newMatrix <- cbind(x,y)
newMatrix
```

```
##      x  y
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13
## [4,] 4 14
## [5,] 5 15
```

---
## Joining vectors and matrices

**rbind()** functions to bind to a matrix as rows.

``` r
newerMatrix <- rbind(newMatrix,z)
newerMatrix
```

```
##    x  y
##    1 11
##    2 12
##    3 13
##    4 14
##    5 15
## z 21 22
```

---
## Joining vectors and matrices

### Incompatible vectors and matrices

When creating a matrix using **cbind()** or **matrix()** from incompatible vectors then the shorter vector is recycled.

``` r
recycledMatrix2 <- matrix(1:5,ncol=2,nrow=3)
```

```
## Warning in matrix(1:5, ncol = 2, nrow = 3): data length [5] is not a
## sub-multiple or multiple of the number of rows [3]
```

``` r
recycledMatrix2
```

```
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    1
```

---
## Joining vectors and matrices

### Incompatible vectors and matrices

For **rbind()** function, the longer vector is clipped.

``` r
recycledMatrix3 <- rbind(recycledMatrix2,c(1:5))
```

```
## Warning in rbind(...): number of columns of result is not a multiple of vector
## length (arg 2)
```

``` r
recycledMatrix3
```

```
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    1
## [4,]    1    2
```

---
## Column and row names

[As we have seen with vectors](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Character_vectors), matrices can be named. For matrices the naming is done by columns and rows using **colnames()** and **rownames()** functions.

``` r
namedMatrix <- matrix(1:10,ncol=5,nrow=2)
colnames(namedMatrix) <- paste("Column",1:5,sep="_")
rownames(namedMatrix) <- paste("Row",1:2,sep="_")
namedMatrix
```

```
##       Column_1 Column_2 Column_3 Column_4 Column_5
## Row_1        1        3        5        7        9
## Row_2        2        4        6        8       10
```

---
## Column and row names

Information on matrix names can also be retreived using the same functions.

``` r
colnames(namedMatrix)
```

```
## [1] "Column_1" "Column_2" "Column_3" "Column_4" "Column_5"
```

``` r
rownames(namedMatrix)
```

```
## [1] "Row_1" "Row_2"
```

---
## Indexing

Selecting and replacing portions of a matrix can be done by **indexing** using square brackets **[]** much [like for vectors](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Indexing).

When indexing matrices, two values may be provided within the square brackets separated by a comma to retrieve information on a matrix position.

The first value(s) corresponds to row(s) and the second to column(s).

- ***myMatrix[rowOfInterest,columnOfInterest]***

---
## Indexing

``` r
narrowMatrix
```

```
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
```
Value of first column, second row

``` r
narrowMatrix[2,1]
```

```
## [1] 2
```

---
## Indexing

Similarly, whole rows or columns can be extracted. Single rows and columns will return a vector.

Values of second column (row index is empty!)

``` r
narrowMatrix[,2]
```

```
## [1]  6  7  8  9 10
```

Values of third row (column index is empty!)

``` r
narrowMatrix[3,]
```

```
## [1] 3 8
```

---
## Indexing

When multiple columns or row indices are specified, a matrix is returned.

Values of second and third row (column index is empty!)

``` r
narrowMatrix[c(2,3),]
```

```
##      [,1] [,2]
## [1,]    2    7
## [2,]    3    8
```

---
## Indexing by name

As with vectors, names can be used for indexing when present

``` r
colnames(narrowMatrix) <- paste("Column",1:2,sep="_")
rownames(narrowMatrix) <- paste("Row",1:5,sep="_")
narrowMatrix[,"Column_1"]
```

```
## Row_1 Row_2 Row_3 Row_4 Row_5 
##     1     2     3     4     5
```

``` r
narrowMatrix["Row_1",]
```

```
## Column_1 Column_2 
##        1        6
```

``` r
narrowMatrix["Row_1","Column_1"]
```

```
## [1] 1
```

---
## Advanced indexing

As with vectors, matrices can be subset by logical vectors

``` r
narrowMatrix
```

```
##       Column_1 Column_2
## Row_1        1        6
## Row_2        2        7
## Row_3        3        8
## Row_4        4        9
## Row_5        5       10
```

``` r
narrowMatrix[,1]
```

```
## Row_1 Row_2 Row_3 Row_4 Row_5 
##     1     2     3     4     5
```

``` r
narrowMatrix[,1] < 5
```

```
## Row_1 Row_2 Row_3 Row_4 Row_5 
##  TRUE  TRUE  TRUE  TRUE FALSE
```

---
## Advanced indexing

``` r
narrowMatrix[narrowMatrix[,1] < 5,]
```

```
##       Column_1 Column_2
## Row_1        1        6
## Row_2        2        7
## Row_3        3        8
## Row_4        4        9
```

---
## Arithmetic operations

As with vectors, matrices can have arithmetic operations applied to cells, rows, columns or the whole matrix

``` r
narrowMatrix
```

```
##       Column_1 Column_2
## Row_1        1        6
## Row_2        2        7
## Row_3        3        8
## Row_4        4        9
## Row_5        5       10
```

``` r
narrowMatrix[1,1]+2
```

```
## [1] 3
```

``` r
narrowMatrix[1,]+2
```

```
## Column_1 Column_2 
##        3        8
```

---
## Arithmetic operations

``` r
mean(narrowMatrix)
```

```
## [1] 5.5
```

---
## Replacement

As with vectors, matrices can have their elements replaced

``` r
narrowMatrix
```

```
##       Column_1 Column_2
## Row_1        1        6
## Row_2        2        7
## Row_3        3        8
## Row_4        4        9
## Row_5        5       10
```

``` r
narrowMatrix[1,1] <- 10
narrowMatrix[,2] <- 1
narrowMatrix
```

```
##       Column_1 Column_2
## Row_1       10        1
## Row_2        2        1
## Row_3        3        1
## Row_4        4        1
## Row_5        5        1
```

---
## Data types

Matrices must be all one type (i.e. numeric or character).

Here replacing one value with character will turn numeric matrix to character matrix.

``` r
narrowMatrix[,2] *2
```

```
## Row_1 Row_2 Row_3 Row_4 Row_5 
##     2     2     2     2     2
```

``` r
narrowMatrix[1,1] <- "Not_A_Number"
narrowMatrix
```

```
##       Column_1       Column_2
## Row_1 "Not_A_Number" "1"     
## Row_2 "2"            "1"     
## Row_3 "3"            "1"     
## Row_4 "4"            "1"     
## Row_5 "5"            "1"
```

---
## Data types

``` r
narrowMatrix[,2] *2
```

```
## Error in narrowMatrix[, 2] * 2: non-numeric argument to binary operator
```

---
## Time for an exercise!

Exercise on matrices can be found [here](../../exercises/exercises/Matrices_exercise.html)

---
## Answers to exercise

Answers can be found here  [here](../../exercises/answers/Matrices_answers.html)

---
class: inverse, center, middle

# Factors

---

## Creating factors

A special case of a vector is a **factor**.

Factors are used to store data which may be grouped in categories (categorical data).
Specifying data as categorical allows R to properly handle the data and make use of functions specific to categorical data.

To create a factor from a vector we use the **factor()** function. Note that the factor now has an additional component called **"levels"** which identifies all categories within the vector.

``` r
vectorExample <- c("male","female","female","female")
factorExample <- factor(vectorExample)
factorExample
```

```
## [1] male   female female female
## Levels: female male
```

``` r
levels(factorExample)
```

```
## [1] "female" "male"
```

---
## Summary() function

An example of the use of levels can be seen from applying the **summary()** function to the vector and factor examples

``` r
summary(vectorExample)
```

```
##    Length     Class      Mode 
##         4 character character
```

``` r
summary(factorExample)
```

```
## female   male 
##      3      1
```

---
## Display order of levels

In our factor example, the levels have been displayed in an alphabetical order. To adjust the display order of levels in a factor, we can supply the desired display order to **levels** argument in the **factor()** function call.

``` r
factorExample <- factor(vectorExample, levels=c("male","female"))
factorExample
```

```
## [1] male   female female female
## Levels: male female
```

``` r
summary(factorExample)
```

```
##   male female 
##      1      3
```

---
## Nominal factors

In some cases there is no natural order to the categories such that one category is greater than the other (nominal data).

``` r
factorExample <- factor(vectorExample, levels=c("male","female"))
factorExample[1] < factorExample[2]
```

```
## Warning in Ops.factor(factorExample[1], factorExample[2]): '<' not meaningful
## for factors
```

```
## [1] NA
```

---
## Ordinal factors

In other cases there will be a natural ordering to the categories (ordinal data). A factor can be specified to be ordered using the **ordered** argument in combination with specified levels argument.

``` r
factorExample <- factor(c("small","big","big","small"),
                        ordered=TRUE,levels=c("small","big"))
factorExample
```

```
## [1] small big   big   small
## Levels: small < big
```

``` r
factorExample[1] < factorExample[2]
```

```
## [1] TRUE
```

---
## Replacement

Unlike vectors, replacing elements within a factor isn't so easy. While replacing one element with an established level is possible, replacing with a novel element will result in a warning.

``` r
factorExample <- factor(c("small","big","big","small"))
factorExample[1] <- c("big")
factorExample
```

```
## [1] big   big   big   small
## Levels: big small
```

``` r
factorExample[1] <- c("huge")
```

```
## Warning in `[<-.factor`(`*tmp*`, 1, value = "huge"): invalid factor level, NA
## generated
```

``` r
factorExample
```

```
## [1] <NA>  big   big   small
## Levels: big small
```

---
## Replacement

To add a new level we can use the levels argument.

``` r
levels(factorExample) <- c("big","small","huge")
factorExample[1] <- c("huge")
factorExample
```

```
## [1] huge  big   big   small
## Levels: big small huge
```

---
class: inverse, center, middle

# Data Frames

---

## Creating data frames

[We saw that with matrices you can only have one type of data](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Data_types). We tried to create a matrix with a character element and the entire matrix became a character.

In practice, we would want to have a table which is a mixture of types (e.g a table with sample names (character), sample type (factor) and survival time (numeric))

---
## Creating data frames

In R, we make use of the **data frame** object which allows us to store tables with columns of different data types. To create a data frame we can simply use the **data.frame()** function.

``` r
patientName <- c("patient1","patient2","patient3","patient4")
patientType <- factor(rep(c("male","female"),2))
survivalTime <- c(1,30,2,20)
dfExample <- data.frame(Name=patientName, Type=patientType, Survival_Time=survivalTime)
dfExample
```

```
##       Name   Type Survival_Time
## 1 patient1   male             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20
```

---
## Indexing and replacement

Data frames may be indexed just [as matrices](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Indexing).

``` r
dfExample
```

```
##       Name   Type Survival_Time
## 1 patient1   male             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20
```

``` r
dfExample[dfExample[,"Survival_Time"] > 10,]
```

```
##       Name   Type Survival_Time
## 2 patient2 female            30
## 4 patient4 female            20
```

---
## Specify columns with $

Unlike matrices, it is possible to index a column by using the **$** symbol.

``` r
dfExample <- data.frame(Name=patientName,Type=patientType,Survival_Time=survivalTime)
dfExample$Survival_Time
```

```
## [1]  1 30  2 20
```

``` r
dfExample[dfExample$Survival_Time < 10,]
```

```
##       Name Type Survival_Time
## 1 patient1 male             1
## 3 patient3 male             2
```

---
## Specify columns with $

Using the **$** allows for R to autocomplete your selection and so can speed up coding.

``` r
dfExample$Surv
```

```
## [1]  1 30  2 20
```
But this will not work..

``` r
dfExample[,"Surv"]
```

---
## Create columns with $

The **$** operator also allows for the creation of new columns for a data frame on the fly.

``` r
dfExample
```

```
##       Name   Type Survival_Time
## 1 patient1   male             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20
```

``` r
dfExample$newColumn <- rep("newData",nrow(dfExample))
dfExample
```

```
##       Name   Type Survival_Time newColumn
## 1 patient1   male             1   newData
## 2 patient2 female            30   newData
## 3 patient3   male             2   newData
## 4 patient4 female            20   newData
```

---
## Indexing and replacement

Replacement can occur in the same way we have seen for less complex data types: use the assignment operator *<-*

``` r
dfExample[dfExample[,"Survival_Time"] < 10,"Survival_Time"] <- 0
dfExample
```

```
##       Name   Type Survival_Time newColumn
## 1 patient1   male             0   newData
## 2 patient2 female            30   newData
## 3 patient3   male             0   newData
## 4 patient4 female            20   newData
```

---
## Indexing and replacement

When we work with factors, for a replacement to be succesful it has to be a possible level.

``` r
dfExample[1,"Type"] <- "other"
```

```
## Warning in `[<-.factor`(`*tmp*`, iseq, value = "other"): invalid factor level,
## NA generated
```

``` r
dfExample
```

```
##       Name   Type Survival_Time newColumn
## 1 patient1   <NA>             0   newData
## 2 patient2 female            30   newData
## 3 patient3   male             0   newData
## 4 patient4 female            20   newData
```

---
## Factors in data frames

It is possible to update factors in data frames just as with standard factors.

``` r
dfExample <- data.frame(Name=patientName,Type=patientType,
                        Survival_Time=survivalTime)

levels(dfExample[,"Type"]) <- c(levels(dfExample[,"Type"]) ,
                                "other")

dfExample[1,"Type"] <- "other"
dfExample
```

```
##       Name   Type Survival_Time
## 1 patient1  other             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20
```

---
## Factors in data frames

If you are running an older version of R (anything from before 4.0), then by default all columns with characters will be considered to be factors.

This is controlled when you make the data frame, by the **stringsAsFactors** argument. Up until recently the default parameter for this was **TRUE**. It now defaults to **FALSE**

``` r
dfExample <- data.frame(Name=patientName,
                        Type=patientType,
                        Survival_Time=survivalTime,
                        stringsAsFactors = T)

dfExample[dfExample[,"Survival_Time"] < 10,"Name"] <- "patientX"
```

```
## Warning in `[<-.factor`(`*tmp*`, iseq, value = c("patientX", "patientX")):
## invalid factor level, NA generated
```

``` r
dfExample
```

```
##       Name   Type Survival_Time
## 1     <NA>   male             1
## 2 patient2 female            30
## 3     <NA>   male             2
## 4 patient4 female            20
```

---
## Factors in data frames

``` r
dfExample <- data.frame(Name=patientName,
                        Type=patientType,
                        Survival_Time=survivalTime,
                        stringsAsFactors = F)

dfExample[dfExample[,"Survival_Time"] < 10,"Name"] <- "patientX"
dfExample
```

```
##       Name   Type Survival_Time
## 1 patientX   male             1
## 2 patient2 female            30
## 3 patientX   male             2
## 4 patient4 female            20
```

---
## Ordering with order()

A useful function in R is **order()**

For numeric vectors, **order()** by default returns the indices of a vector in that vector's increasing order. This behavior can be altered by using the "decreasing" argument passed to order.

``` r
testOrder <- c(20,100,45, 31)
testOrder
```

```
## [1]  20 100  45  31
```

``` r
order(testOrder)
```

```
## [1] 1 4 3 2
```

``` r
order(testOrder,decreasing=T)
```

```
## [1] 2 3 4 1
```

---
## Ordering with order()

Once you have a vector of ordered indices, you can then use this to index your object.

``` r
testOrder[order(testOrder)]
```

```
## [1]  20  31  45 100
```

``` r
testOrder[order(testOrder,decreasing=T)]
```

```
## [1] 100  45  31  20
```

---
##  Ordering with NA values

When a vector contains NA values, these NA values will, by default, be placed last in ordering indices. This can be controlled by **na.last** argument.

``` r
testOrder <- c(2,1,NA,3)
testOrder[order(testOrder,decreasing=T,na.last=T)]
```

```
## [1]  3  2  1 NA
```

``` r
testOrder[order(testOrder,decreasing=T,na.last=F)]
```

```
## [1] NA  3  2  1
```

---
## Ordering data frames

Since the order argument returns an index of intended order for a vector, we can use the order() function to order data frames by certain columns.

``` r
dfExample
```

```
##       Name   Type Survival_Time
## 1 patientX   male             1
## 2 patient2 female            30
## 3 patientX   male             2
## 4 patient4 female            20
```

``` r
dfExample[order(dfExample$Surv, decreasing=T),]
```

```
##       Name   Type Survival_Time
## 2 patient2 female            30
## 4 patient4 female            20
## 3 patientX   male             2
## 1 patientX   male             1
```

---
## Ordering data frames

We can also use order to arrange multiple columns in a data frame by providing multiple vectors to order() function. Ordering will be performed in order of arguments.

``` r
dfExample[order(dfExample$Type,
                dfExample$Survival,
                decreasing=T),]
```

```
##       Name   Type Survival_Time
## 3 patientX   male             2
## 1 patientX   male             1
## 2 patient2 female            30
## 4 patient4 female            20
```

---
## Merging data frames

A common operation is to join two data frames by a column of common values.

``` r
dfExample <- data.frame(Name=patientName,Type=patientType,
                        Survival_Time=survivalTime)
dfExample 
```

```
##       Name   Type Survival_Time
## 1 patient1   male             1
## 2 patient2 female            30
## 3 patient3   male             2
## 4 patient4 female            20
```

``` r
dfExample2 <- data.frame(Name=patientName[1:3],
                        height=c(6.1,5.1,5.5))
dfExample2
```

```
##       Name height
## 1 patient1    6.1
## 2 patient2    5.1
## 3 patient3    5.5
```

---
## Merging data frames

To do this we can use the **merge()** function with the data frames as the first two arguments. We can then specify the columns to merge by with the **by** argument. To keep only data pertaining to values common to both data frames the **all** argument is set to FALSE.

``` r
mergedDF <- merge(dfExample,dfExample2,by=1,all=F)
mergedDF
```

```
##       Name   Type Survival_Time height
## 1 patient1   male             1    6.1
## 2 patient2 female            30    5.1
## 3 patient3   male             2    5.5
```

---
## Time for an exercise!

Exercise on data frames can be found [here](../../exercises/exercises/FactorsAndDataframes_exercise.html)

---
## Answers to exercise

Answers can be found here  [here](../../exercises/answers/FactorsAndDataframes_answers.html)

---
class: inverse, center, middle

# Lists

---

## Creating lists

Lists are the final data type we will look at.

In R, lists provide a general container which may hold any data types of unequal lengths as part of its elements.

``` r
firstElement <- c(1,2,3,4)
secondElement <- matrix(1:10,nrow=2,ncol=5)
thirdElement <- data.frame(colOne=c(1,2,4,5),colTwo=c("One","Two","Three","Four"))
```

---
## lists
To create a list we can simply use the **list()** function with arguments specifying the data we wish to include in the list.

``` r
myList <- list(firstElement,secondElement,thirdElement)
myList
```

```
## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
## 
## [[3]]
##   colOne colTwo
## 1      1    One
## 2      2    Two
## 3      4  Three
## 4      5   Four
```

---
## Named lists
 
[Just as with vectors](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Character_vectors), list elements can be assigned names.

``` r
myNamedList <- list(First=firstElement,Second=secondElement,
                    Third=thirdElement)
myNamedList
```

```
## $First
## [1] 1 2 3 4
## 
## $Second
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
## 
## $Third
##   colOne colTwo
## 1      1    One
## 2      2    Two
## 3      4  Three
## 4      5   Four
```

---
## Indexing

List, as [with other data types in R](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Indexing) can be indexed. In contrast to other types, using **[]** on a list will subset the list to another list of selected indices. To retrieve an element from a list in R , two square brackets **[[]]** must be used.

``` r
myList <- list(firstElement,secondElement,thirdElement)
myList[1]
```

```
## [[1]]
## [1] 1 2 3 4
```

``` r
myList[[1]]
```

```
## [1] 1 2 3 4
```

As with data.frames, the $ sign may be used to extract named elements from a list

``` r
myNamedList$First
```

```
## [1] 1 2 3 4
```

---
## Joining lists

Again, [similar to vectors](https://rockefelleruniversity.github.io/Intro_To_R_1Day/presentations/singlepage/introToR_Session1.html#Vectors25), lists can be joined together in R using the c() function

``` r
myNamedList <- list(First=firstElement,Second=secondElement,
                    Third=thirdElement)
myNamedList <- c(myNamedList,list(fourth=c(4,4)))
myNamedList[c(1,4)]
```

```
## $First
## [1] 1 2 3 4
## 
## $fourth
## [1] 4 4
```

---
## Joining vectors to lists

Note that on last slide we are joining two lists. If we joined a vector to a list, all elements of the vector would become list elements.

``` r
myList <- c(myList,c(4,4))
myList[3:5]
```

```
## [[1]]
##   colOne colTwo
## 1      1    One
## 2      2    Two
## 3      4  Three
## 4      5   Four
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 4
```

---
## Flattening lists

Sometimes you will wish to "flatten" out a list. When a list contains compatable objects, i.e. list of all one type, the **unlist()** function can be used. Note the maintenance of names with their additional sufficies.

``` r
myNamedList <- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
myNamedList
```

```
## $First
## [1] 1 2 3
## 
## $Second
## [1] 2 6 7
## 
## $Third
## [1] 1 4 7
```

``` r
flatList <- unlist(myNamedList)
flatList[1:7]
```

```
##  First1  First2  First3 Second1 Second2 Second3  Third1 
##       1       2       3       2       6       7       1
```

---
## Flattening lists to matrices

A common step is to turn a list of standard results into matrix. This can be done in a few steps in R, using functions you are now familair with.

``` r
myNamedList <- list(First=c(1,2,3),Second=c(2,6,7),Third=c(1,4,7))
flatList <- unlist(myNamedList)
listAsMat <- matrix(flatList,
                    nrow=length(myNamedList),
                    ncol=3,
                    byrow=T,
                    dimnames=list(names(myNamedList)))
listAsMat
```

```
##        [,1] [,2] [,3]
## First     1    2    3
## Second    2    6    7
## Third     1    4    7
```

---
## Time for an exercise!

Exercise on matrices can be found [here](../../exercises/exercises/Lists_exercise.html)

---
## Answers to exercise

Answers can be found here  [here](../../exercises/answers/Lists_answers.html)

---
class: inverse, center, middle

# Coercing data classes

---

## Identifying an objects class

We can use the class() function to tell us the class of our variable or object.

``` r
class(namedMatrix)
```

```
## [1] "matrix" "array"
```

``` r
class(dfExample)
```

```
## [1] "data.frame"
```

---
## Coercing data class
Sometime you may want to cast an object into a new class There are a group of functions to coerce your data object to a new format, **as._desiredClass_()**.

``` r
namedMatrix
```

```
##       Column_1 Column_2 Column_3 Column_4 Column_5
## Row_1        1        3        5        7        9
## Row_2        2        4        6        8       10
```

``` r
as.character(namedMatrix)
```

```
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
```

``` r
as.vector(namedMatrix)
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

---
## Coercing data class

``` r
as.data.frame(namedMatrix)
```

```
##       Column_1 Column_2 Column_3 Column_4 Column_5
## Row_1        1        3        5        7        9
## Row_2        2        4        6        8       10
```

``` r
as.list(namedMatrix)
```

```
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8
## 
## [[9]]
## [1] 9
## 
## [[10]]
## [1] 10
```

---
## Objects

All these common data classes we have covered so far are examples of R objects.

As shown we can assess the type of object using the **class()** function.

``` r
class(dfExample)
```

```
## [1] "data.frame"
```

---
## Working with objects

We have also seen that functions act differently with the different data types.

``` r
summary(namedMatrix)
```

```
##     Column_1       Column_2       Column_3       Column_4       Column_5    
##  Min.   :1.00   Min.   :3.00   Min.   :5.00   Min.   :7.00   Min.   : 9.00  
##  1st Qu.:1.25   1st Qu.:3.25   1st Qu.:5.25   1st Qu.:7.25   1st Qu.: 9.25  
##  Median :1.50   Median :3.50   Median :5.50   Median :7.50   Median : 9.50  
##  Mean   :1.50   Mean   :3.50   Mean   :5.50   Mean   :7.50   Mean   : 9.50  
##  3rd Qu.:1.75   3rd Qu.:3.75   3rd Qu.:5.75   3rd Qu.:7.75   3rd Qu.: 9.75  
##  Max.   :2.00   Max.   :4.00   Max.   :6.00   Max.   :8.00   Max.   :10.00
```

``` r
summary(dfExample)
```

```
##      Name               Type   Survival_Time  
##  Length:4           female:2   Min.   : 1.00  
##  Class :character   male  :2   1st Qu.: 1.75  
##  Mode  :character              Median :11.00  
##                                Mean   :13.25  
##                                3rd Qu.:22.50  
##                                Max.   :30.00
```

---
## Working with objects

In fact simple interactions, such as indexing, with the different data types can be very different.

``` r
namedMatrix[,1]
```

```
## Row_1 Row_2 
##     1     2
```

``` r
dfExample$Type
```

```
## [1] male   female male   female
## Levels: female male
```

---
## Working with objects

To learn about how to work with the different data types we can use the help functions built into R

``` r
?data.frame
```

---
## More complex objects

As we progress through using R in our day to day analysis we will want to take advantage of more complex objects available in either Base R or in one of the many extension packages.

By using pre-established objects we gain access to the routines and functions built for that object's special use cases.

An example of a more complex object type is that which helps us manage Dates or Times.

``` r
Time <- Sys.time()
Time
```

```
## [1] "2025-08-11 07:17:05 UTC"
```

---
## More complex objects

As you can see this object shows the local date and time in a human readable way.

The object however is not one of our standard data types but is a new data type called **POSIXct**

``` r
class(Time)
```

```
## [1] "POSIXct" "POSIXt"
```

---
## More complex objects

Lets use the the help function to learn more about this object.

``` r
?POSIXct
```

---
## More complex objects

We can see from the help that we can use logical operators with our POSIXct object.

``` r
TimeNow <- Sys.time()
TimeNow > Time
```

```
## [1] TRUE
```

---
## More complex objects

We can also use the arithmetic operations with our time objects.

``` r
Time
```

```
## [1] "2025-08-11 07:17:05 UTC"
```

``` r
Time - 120
```

```
## [1] "2025-08-11 07:15:05 UTC"
```

``` r
TimeNow - Time
```

```
## Time difference of 0.1045375 secs
```

---
## More complex objects

Most objects will have a default display defined by a print method

We can control the printing method by using the **format()** method for date time POSIXct objects.

We can also change the timezone by specifying a **tz** parameter

``` r
format(Time,format="%H O'Clock %p %A on %B %dth")
```

```
## [1] "07 O'Clock AM Monday on August 11th"
```

``` r
format(Time,format="%H O'Clock %p %A on %B %dth",tz = "GMT")
```

```
## [1] "07 O'Clock AM Monday on August 11th"
```

---
## Complex to base objects

Most of the time we can convert more complex object back to our basic object types we are more familar with.

``` r
as.character(Time)
```

```
## [1] "2025-08-11 07:17:05.870341"
```

``` r
as.numeric(TimeNow-Time)
```

```
## [1] 0.1045375
```

---
## Complex objects summary

In future courses we will come across object which help us handle data such as from BAM files, FastQ files and single cell experiments.

How to interact with these objects will be defined in their help pages.

By using these specialized objects we will be able to focus on understanding and analyzing the data contained within the object instead of how to format and interact with such complex data.

---
class: inverse, center, middle

# Reading and Writing Data

---

---
## Working Directory

Most of the time, you will not be generating data in R but will be importing data from external files. Before we can read in data, you need to know from which vantage point your R console is looking. This view point is the Working Directory.

![location](imgs/youarehere.png)

---
## Working Directory

You can check your working directory using the *getwd()* function.

``` r
getwd()
```

```
## [1] "/__w/Intro_To_R_1Day/Intro_To_R_1Day/extdata"
```

```
[1] "/Users/mattpaul"
```

---
## Setting the Working directory

Your current working directory is typically your *HOME* directory. When we do work on a project we typically want to keep everything in on place i.e. input files, output files, scripts etc. It is also good to set our working directory there.

We will do this in our downloaded course material, so everyone is in the same place.

**Session -> Set Working Directory -> Choose Directory**

or in the console.

``` r
setwd("/PathToMyDownload/Intro_To_R_1Day-master/r_course")
# e.g. setwd("/Users/mattpaul/Downloads/Intro_To_R_1Day/r_course")
```

---
## Paths

When you give a path to R it can either be relative or absolute.

![location](imgs/youarehere2.png)

---
## Paths

To use our map analogy:

- Relative path are like directions i.e. take a left, go straight then take a right etc. The context of where you start is essential.

- Absolute paths are like an address. They give the final location in absence of any other external information.

Both have their benefits.

---

## Paths in use

The command we used before was using an absolute path. Typically they start with "/" to get to the top level of your computers file structure.

``` r
setwd("/Users/mattpaul/Downloads/Intro_To_Python-master/r_course/")
```

Given that we started at: */Users/mattpaul*, we could have also used the path below. This uses the knowledge of our start position to specify where we are going.

``` r
setwd("Downloads/Intro_To_Python-master/r_course/")
```

---
## Data from external sources

Now we have set our location we can start thinking about reading in data. A standard format for data is a table:

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Gene_Name </th>
   <th style="text-align:right;"> Sample_1.hi </th>
   <th style="text-align:right;"> Sample_2.hi </th>
   <th style="text-align:right;"> Sample_3.hi </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Gene_a </td>
   <td style="text-align:right;"> 3.498335 </td>
   <td style="text-align:right;"> 2.579126 </td>
   <td style="text-align:right;"> 2.006222 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gene_b </td>
   <td style="text-align:right;"> 3.533382 </td>
   <td style="text-align:right;"> 4.883157 </td>
   <td style="text-align:right;"> 2.333487 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gene_c </td>
   <td style="text-align:right;"> 5.716186 </td>
   <td style="text-align:right;"> 4.419684 </td>
   <td style="text-align:right;"> 3.994441 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gene_d </td>
   <td style="text-align:right;"> 3.760701 </td>
   <td style="text-align:right;"> 5.558142 </td>
   <td style="text-align:right;"> 3.611492 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gene_e </td>
   <td style="text-align:right;"> 9.769188 </td>
   <td style="text-align:right;"> 9.280144 </td>
   <td style="text-align:right;"> 8.945527 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gene_f </td>
   <td style="text-align:right;"> 9.557602 </td>
   <td style="text-align:right;"> 11.732771 </td>
   <td style="text-align:right;"> 8.218103 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gene_g </td>
   <td style="text-align:right;"> 9.484837 </td>
   <td style="text-align:right;"> 11.517326 </td>
   <td style="text-align:right;"> 8.910251 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gene_h </td>
   <td style="text-align:right;"> 11.133336 </td>
   <td style="text-align:right;"> 9.423452 </td>
   <td style="text-align:right;"> 10.081443 </td>
  </tr>
</tbody>
</table>

---
## Data from text file with read.table()

Tables from text files can be read with **read.table()** function. We are providing a relative path to our file as the first argument.

``` r
Table <- read.table("data/readThisTable.csv",sep=",",header=T)
Table[1:4,1:3]
```

```
##   Gene_Name Sample_1.hi Sample_2.hi
## 1    Gene_a    4.570237    3.230467
## 2    Gene_b    3.561733    3.632285
## 3    Gene_c    3.797274    2.874462
## 4    Gene_d    3.398242    4.415202
```

Here we have provided two additional arguments. 
- **sep** argument specifies how columns are separated in our text file. ("," for .csv, "\t" for .tsv)
- **header** argument specifies whether columns have headers.

---
## Row names in read.table()

read.table() allows for significant control over reading files through its many arguments. Have a look at options by using **?read.table**

The **row.names** argument can be used to specify a column to use as row names for the resulting data frame. Here we use the first column as row names.

``` r
Table <- read.table("data/readThisTable.csv",sep=",",header=T,row.names=1)
Table[1:4,1:3]
```

```
##        Sample_1.hi Sample_2.hi Sample_3.hi
## Gene_a    4.570237    3.230467    3.351827
## Gene_b    3.561733    3.632285    3.587523
## Gene_c    3.797274    2.874462    4.016916
## Gene_d    3.398242    4.415202    4.893561
```

---
## Setting factors from read.table()

As mentioned, data which is read into R through read.table() will be of data frame class.

Similar to when we create data frames, we can control whether the data is read in as a factor or not with the **stringsAsFactors** argument. The default of this has also changed in R 4.0, from **TRUE** to **FALSE**.

``` r
Table <- read.table("data/readThisTable.csv", sep=",", header=T, stringsAsFactors=F)
```

Other very useful functions for read table include:
- **skip** - To set number of lines to skip when reading.
- **comment.char** - To set the start identifier for lines not to be read.

---
## Data from other sources

The read.table function can also read data from http.

``` r
URL <- "https://raw.githubusercontent.com/RockefellerUniversity/Intro_To_R_1Day/refs/heads/master/readThisTable.csv"
Table <- read.table(URL,sep=",",header=T)
Table[1:2,1:3]
```

```
##   Gene_Name Sample_1.hi Sample_2.hi
## 1    Gene_a    4.570237    3.230467
## 2    Gene_b    3.561733    3.632285
```

And the clipboard (this is Windows version).

``` r
Table <- read.table(file="clipboard",sep=",",header=T)
```

---
## Data from file columns

read.table() function will by default read every row and column of a file.

The **scan()** function allows for the selection of particular columns to be read into R and so can save memory when files are large.

``` r
x <- scan("data/readThisTable.csv",sep=",",
          what = as.list(c("character",rep("numeric", 6))),skip=1)
x[1:3]
```

```
## [[1]]
## [1] "Gene_a" "Gene_b" "Gene_c" "Gene_d" "Gene_e" "Gene_f" "Gene_g" "Gene_h"
## 
## [[2]]
## [1] "4.57023720364456" "3.56173302139372" "3.79727358461183" "3.39824234540912"
## [5] "10.1287867100999" "8.4743967992122"  "10.0100200884644" "9.39999241674877"
## 
## [[3]]
## [1] "3.23046698308814" "3.63228532632679" "2.87446166873403" "4.41520211046494"
## [5] "10.2240711640563" "8.61262799641256" "10.3123540206195" "10.3328437472096"
```

---
## Writing data to file

Once we have our data analysed in R, we will want to export it to a file.

The most common method is to use the write.table() function

``` r
write.table(Table, file="data/writeThisTable.csv", sep=",")
```

Since our data has column names but no row names, I will provide the arguments col.names and row.names to write.table()

``` r
write.table(Table, file="data/writeThisTable.csv", sep=",", row.names =F, col.names=T)
```

---
# Reviewing your data
It is always important to know what your data is. Especially when you are reading it in for the first time. We have used indexing to get a taste of the data frames so far. But there are two functions to quickly check your data. **head()** and **tail()** return the first or last 6 lines by default.

``` r
head(Table)
```

```
##   Gene_Name Sample_1.hi Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low
## 1    Gene_a    4.570237    3.230467    3.351827     3.930877     4.098247
## 2    Gene_b    3.561733    3.632285    3.587523     4.185287     1.380976
## 3    Gene_c    3.797274    2.874462    4.016916     4.175772     1.988263
## 4    Gene_d    3.398242    4.415202    4.893561     8.432342     9.609151
## 5    Gene_e   10.128787   10.224071    8.945813     2.936174     3.892924
## 6    Gene_f    8.474397    8.612628    7.170830     3.299351     2.575870
##   Sample_1.low
## 1    4.4187260
## 2    5.9369901
## 3    3.7809172
## 4    9.0198647
## 5    0.9192079
## 6    2.5427773
```
---
# Reviewing your data

``` r
tail(Table)
```

```
##   Gene_Name Sample_1.hi Sample_2.hi Sample_3.hi Sample_4.low Sample_5.low
## 3    Gene_c    3.797274    2.874462    4.016916     4.175772     1.988263
## 4    Gene_d    3.398242    4.415202    4.893561     8.432342     9.609151
## 5    Gene_e   10.128787   10.224071    8.945813     2.936174     3.892924
## 6    Gene_f    8.474397    8.612628    7.170830     3.299351     2.575870
## 7    Gene_g   10.010020   10.312354   11.603290     9.930704     7.748795
## 8    Gene_h    9.399992   10.332844    9.378217    10.065200    10.788619
##   Sample_1.low
## 3    3.7809172
## 4    9.0198647
## 5    0.9192079
## 6    2.5427773
## 7    9.7988238
## 8   10.2453258
```

``` r
head(Table,3)
```

---
## The rio (R io) package

We may want to import from formats other than plain text.

We can make use of an R package (the rio package) which allows us to import and export data to mulitple formats.

Formats include:

* XML.
* Matlab, SAS, SPSS and minitab output formats.
* Excel and OpenOffice formats.

---
## The rio package

To make use of the rio package functionality we will need to install this package to our version of R.

We can do this by using the **install.packages()** function with the package we wish to install.

**install.packages(_PACKAGENAME_)**

``` r
install.packages("rio")
```

---
## The rio package

Once we have installed a package, we will need to load it to make the functions available to us.

We can load a library by using the **library()** function with package we wish to install

**library(_PACKAGENAME_)**

``` r
library("rio")
```

---
## The rio package

The main two functions in the rio package are the **import** and **export** functions.

We can use the **import()** function to read in our csv file. We simple specify our file as an argument to the **import()** function.

**import(_Filename_)**

``` r
Table <- import("data/readThisTable.csv")
Table[1:2,]
```

---
## The rio package

By default we will only retrieve the first sheet.

We can specify the sheet by name or number using the **which** argument.

``` r
Table <- import("data/readThisXLS.xls", 
                which=2)
Table <- import("data/readThisXLS.xls", 
                which="Metadata")
Table[1:2,]
```

```
##       Patient Condition   Treatment
## 1 Sample_1.hi         A           X
## 2 Sample_2.hi         A NoTreatment
```

---
## The rio package

If we want to import all sheets, we can use the **import_list**.

This returns a *list* containing our two spreadsheets. The list has two elements named after the corresponding XLS sheet.

``` r
Table <- import_list("data/readThisXLS.xls")
names(Table)
```

```
## [1] "ExpressionScores" "Metadata"
```

---
## The rio package

Since this is a list of data.frames, we can access our sheets using standard list accessors, **$** and **[[]]**.

``` r
## Table[["ExpressionScores"]][1:2,]
Table$ExpressionScores[1:2,]
```

``` r
Table$Metadata[1:2,]
```

```
##       Patient Condition   Treatment
## 1 Sample_1.hi         A           X
## 2 Sample_2.hi         A NoTreatment
```

---
## The rio package

We can export our data back to file using the **export()** function and specifying the name of the output file to the **file** argument. The **export()** function will guess the format required from the extension.

``` r
ExpressionScores <- Table$ExpressionScores
export(ExpressionScores,file = "data/writeThisXLSX.xlsx")
```

---
## The rio package

We can export a list of data.frames to Excel's xlsx format using the **export()** function. The names of list elements will be used to name sheets in xlsx file.

``` r
names(Table) <- c("expr","meta")
export(Table, file = "data/writeThisMultipleXLSX.xlsx")
```

---
## Save and read data

If you have an R object that is not rectangular, i.e. a list or a specialist R object, you can still save it.

Lets remake our list.

``` r
firstElement <- c(1,2,3,4)
secondElement <- matrix(1:10,nrow=2,ncol=5)
thirdElement <- data.frame(colOne=c(1,2,4,5),colTwo=c("One","Two","Three","Four"))

myList <- list(firstElement,secondElement,thirdElement)
myList
```

---
## Save and read data

We can use the *saveRDS()* and *readRDS()* functions to save and read in our R object. They are saved as a *.rds object.

``` r
saveRDS(myList, "my_list.rds")
```

``` r
my_newlist <- readRDS("my_list.rds")
my_newlist
```

---
## Save and read data

There is also a *save()* function. This can be used to save multiple data objects into a single *RData* file. We have to ensure that we name the argument for file output (*file=*).

``` r
save(Table, myList, file = "my_list.RData")
```

We simply use *load()* to read it back in. We cannot assign this to a new variable as there are multiple objects. Instead load remembers the objects original name i.e. Table or myList.

``` r
load("my_list.RData")
```

---
## Time for an exercise!

Exercise on reading and writing data can be found [here](../../exercises/exercises/DataInputOutput_exercises.html)

---
## Answers to exercise

Answers can be found [here](../../exercises/answers/DataInputOutput_answers.html)