Factors and Data frames

These exercises are about the factors and data frames sections of Introduction to R.

Exercise 1 - Factors

Create a nominal factor called CellType containing: “DC1”,“DC1”,“DC1”,“NK”,“NK”,“Mono”,“Mono”,“DC2”,“NK”

## [1] DC1  DC1  DC1  NK   NK   Mono Mono DC2  NK  
## Levels: DC1 DC2 Mono NK

Modify the the third position of CellType to “Neu”, by modifying the levels of the factor.

## [1] DC1  DC1  Neu  NK   NK   Mono Mono DC2  NK  
## Levels: DC1 DC2 Mono NK Neu

Create CellType2 with the same entries, but directly specify the levels to include: “DC1”, “DC2”, “Mono”, “NK”, “Neu”, “Bcell”, “Tcell”.

## [1] DC1  DC1  DC1  NK   NK   Mono Mono DC2  NK  
## Levels: DC1 DC2 Mono NK Neu Bcell Tcell

Use combine to increase the length of CellType2 to include: “Neu”,“Neu”,“Bcell”,“DC1”

##  [1] DC1   DC1   DC1   NK    NK    Mono  Mono  DC2   NK    Neu   Neu   Bcell
## [13] DC1  
## Levels: DC1 DC2 Mono NK Neu Bcell Tcell

Summarize the number of entries for each cell type.

summary(CellType2)

##   DC1   DC2  Mono    NK   Neu Bcell Tcell 
##     4     1     2     3     2     1     0

Reorder the summary to alphabetical order

levels(CellType2) <- c("Bcell","DC1", "DC2", "Mono", "Neu","NK","Tcell")
summary(CellType2)

## Bcell   DC1   DC2  Mono   Neu    NK Tcell 
##     4     1     2     3     2     1     0

Create a ordinal factor named “Height” containing – high, low, mid, low, mid, low, mid, high, mid, high.

##  [1] high low  mid  low  mid  low  mid  high mid  high
## Levels: low < mid < high

Using a logical index, create new factor of only those from “Height”” greater than low.

## [1] high mid  mid  mid  high mid  high
## Levels: low < mid < high

Replace the last index in “Height” with veryHigh and create new factor with those greater than mid.

## [1] high     high     veryHigh
## Levels: low < mid < high < veryHigh

Exercise 2 - Data frames

Create data frame called Annotation with a column of gene names (“Gene_1”, “Gene_2”, “Gene_3”,“Gene_4”,“Gene_5”), ensembl gene names (“Ens001”, “Ens003”, “Ens006”, “Ens007”, “Ens010”), pathway information (“Glycolysis”, “TGFb”, “Glycolysis”, “TGFb”, “Glycolysis”) and gene lengths (100, 3000, 200, 1000,1200).

##   geneNames ensembl    pathway geneLengths
## 1    Gene_1  Ens001 Glycolysis         100
## 2    Gene_2  Ens003       TGFb        3000
## 3    Gene_3  Ens006 Glycolysis         200
## 4    Gene_4  Ens007       TGFb        1000
## 5    Gene_5  Ens010 Glycolysis        1200

Filter Annotation to geneLengths that are greater than 500 and less than 2000. Use the dollar sign to extract column information.

##   geneNames ensembl    pathway geneLengths
## 4    Gene_4  Ens007       TGFb        1000
## 5    Gene_5  Ens010 Glycolysis        1200

Check the data types of each column. Update the pathway column to be a factor.

## [1] "character"

## [1] "character"

## [1] "character"

## [1] "numeric"

##   geneNames ensembl    pathway geneLengths
## 1    Gene_1  Ens001 Glycolysis         100
## 2    Gene_2  Ens003       TGFb        3000
## 3    Gene_3  Ens006 Glycolysis         200
## 4    Gene_4  Ens007       TGFb        1000
## 5    Gene_5  Ens010 Glycolysis        1200

## [1] "factor"

Create data frame called Sample1 with ensembl gene names (“Ens001”, “Ens003”, “Ens006”, “Ens010”) and expression (1000, 3000, 10000,5000)

##   ensembl expression
## 1  Ens001       1000
## 2  Ens003       3000
## 3  Ens006      10000
## 4  Ens010       5000

Create data frame called Sample2 with ensembl gene names (“Ens001”, “Ens003”, “Ens006”, “Ens007”,“Ens010”) and expression (1500, 1500, 17000,500,10000)

##   ensembl expression
## 1  Ens001       1500
## 2  Ens003       1500
## 3  Ens006      17000
## 4  Ens007        500
## 5  Ens010      10000

Create a data frame containing only those gene names common to all data frames with all information from Annotation and the expression from Sample 1 and Sample 2.

##   ensembl geneNames    pathway geneLengths expression.x expression.y
## 1  Ens001    Gene_1 Glycolysis         100         1000         1500
## 2  Ens003    Gene_2       TGFb        3000         3000         1500
## 3  Ens006    Gene_3 Glycolysis         200        10000        17000
## 4  Ens010    Gene_5 Glycolysis        1200         5000        10000

Order our new dataframe by geneLengths - biggest to smallest.

##   ensembl geneNames    pathway geneLengths expression.x expression.y
## 2  Ens003    Gene_2       TGFb        3000         3000         1500
## 4  Ens010    Gene_5 Glycolysis        1200         5000        10000
## 3  Ens006    Gene_3 Glycolysis         200        10000        17000
## 1  Ens001    Gene_1 Glycolysis         100         1000         1500

Add an extra two columns containing the length normalized expressions for Sample 1 and Sample 2

##   ensembl geneNames    pathway geneLengths expression.x expression.y
## 2  Ens003    Gene_2       TGFb        3000         3000         1500
## 4  Ens010    Gene_5 Glycolysis        1200         5000        10000
## 3  Ens006    Gene_3 Glycolysis         200        10000        17000
## 1  Ens001    Gene_1 Glycolysis         100         1000         1500
##   Sample1_lne Sample2_lne
## 2    1.000000    0.500000
## 4    4.166667    8.333333
## 3   50.000000   85.000000
## 1   10.000000   15.000000

Identify the mean length normalized expression across Sample1 and Sample2 for Ens006 genes

## [1] 67.5

For all genes, identify the log2 fold change in length normalized expression from Sample 1 to Sample 2.

##     Gene_2     Gene_5     Gene_3     Gene_1 
## -1.0000000  1.0000000  0.7655347  0.5849625

Identify the total length of genes in Glycolysis pathway.

## [1] 1500

Factors and Data frames

Rockefeller University, Bioinformatics Resource Centre

https://rockefelleruniversity.github.io/Intro_To_R_1Day/