Environments


Set Up

All prerequisites, links to material and slides for this course can be found on github.

Or can be downloaded as a zip archive from here.

Course materials

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

  • presentations/slides/ Presentations as an HTML slide show.
  • presentations/singlepage/ Presentations as an HTML single page.
  • presentations/r_code/ R code in presentations.
  • exercises/ Practicals as HTML pages.
  • answers/ Practicals with answers as HTML pages and R code solutions.

Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived Reproducible_R folder in the Rstudio menu.

Session -> Set Working Directory -> Choose Directory

or in the console.

setwd("/PathToMyDownload/Reproducible_R-master/r_course")
# e.g. setwd('~/Downloads/Reproducible_R-master/r_course')

Why does reproducibility matter?


Reproducible Bioinformatics

Good science is built on being reproducible. This does not end at the bench.

People often face reproducibility issues when trying to repeat someone else’s analysis (or even their own).

In theory it should be easy. Just take the raw data and reprocess it in the same way. But there are often gaps in documentation.

NIH and FAIR

New funding applications now require a more extensive data management plan. Though this doesn’t seem to have strong teeth for code, the policy does say:

“Related Tools, Software and/or Code Indicate whether specialized tools are needed to access or manipulate shared scientific data to support replication or reuse, and name(s) of the needed tool(s) and software. If applicable, specify how needed tools can be accessed.”

This fits with their ethos of data following the FAIR principles of being: Findable, Accesible, Interoperable, and Reusable. Over time we can see this becoming a more necessary part of published work.

Environments and package management

Often where people struggle to repeat work, it is not to do with issues with the data. It is instead bad documentation of the workflow. By accurately recording what packages and versions were used we circumvent a lot of these issues.

Reproducibility?


Base R

R contains a ton of useful tools and functions which allow us to perfrom more complex operations out of the box.

We get access to these when R loads without having to install or load any packages.

These include arithmetic operations, functions for reading/writing data and utlity functions.

myMat <- matrix(1:100, ncol = 10, byrow = TRUE)
means <- colMeans(myMat)
write.csv(myMat, file = "Test.csv")

Base R

We get access to these functions through base packages loaded as R starts.

These base packages include

  • base
  • graphics
  • grDevices
  • utils
  • datasets
  • methods

Base R

We can get a comprehensive list of functions available in these packages using R help.

We can type ?base into R and get an index of all functions in the base package.

?base

R and Base R versions

If we review the help index we can see that the version of the base R package matches the version of R we are using.

base r version

Versioning

Versioning of software allows developers to make updates to their software while maintaining availability/trackability of previous versions of their software.

This also allows users to track the exact version of software they use to maintain essential reproducibility within their own work.

The current version of the R software is 4.3.1

Versioning in R packages

R package versions follow the convention of 3 numbers representing the major, minor and patch versions of the software.

PackageName major_._minor_._patch

e.g. ggplot2 current version is ggplot2 - 3_._4_._2

  • Major - A major version update usually involves changes that affect backwards capability with previous versions and substantial changes to an API
  • Minor - A minor version update typically occurs when adding new functionality to a package
  • Patch - A patch update will occur when bugs/issues are fixed within a package but no new functionality is added.

Finding R package versions

To find current package versions of libraries we are using in the current R session as well as information on the version of R in use we can take advantage of functions within another base package - the utils package.

The utils package contains many functions on package installation and version management.

To get a look at all the available functions in base we can use the R help again

?utils

The sessionInfo function

One very useful function within the utils package is sessionInfo().

First we can run sessionInfo() function with no arguments.

sessionInfo()

The sessionInfo function

  • R Version information.
  • Attached base packages
    • Attached packages are loaded and their functions available
  • Loaded via a namespace.
    • These packages are used by other attached packages but functions are not available to R user

The sessionInfo function

If we load a few more packages into R we can see this reflected in the sessionInfo() output under the other attached packages.

library(ggplot2)
library(dplyr)
sessionInfo()

The sessionInfo function

We can also use the sessionInfo function to provide the version information on a specified package.

We simply need to add an argument to the sessionInfo function call of the package name and the specified package’s versions is shown under the other attached packages section.

sessionInfo("ggplot2")

The sessionInfo function

The sessionInfo function returns an object of class sessioninfo containing the R/package version information.

sess_info <- sessionInfo()
class(sess_info)
## [1] "sessionInfo"
names(sess_info)
##  [1] "R.version"   "platform"    "locale"      "tzone"       "tzcode_type"
##  [6] "running"     "RNGkind"     "basePkgs"    "otherPkgs"   "loadedOnly" 
## [11] "matprod"     "BLAS"        "LAPACK"      "LA_version"

Finding R package versions in Rstudio

Rstudio makes package version discovery easy using the packages panel

Here you can see -

  • Package Name
  • Package Description
  • Package Version
  • Link to package page
  • Button to delete package

]

]

Getting access to new packages

Although we have access to all the functions within the base packages by default we most likely will want to take advantage functions in the 1000s of packages available in the many R package repositories.

Package repositories hold collections of packages under differing versions available for download or review.

The two most popular R package repositories are:-

  • CRAN
  • Bioconductor

CRAN

CRAN , Comprehensive R Archive Network, was the first and is the most popular R package repository.

It takes it’s name from CTAN, Comprehensive Tex Archive Network, and inspiration from CPAN, Comprehensive Perl Archive Network.

CRAN provides both R software download links as well as hosts versioned R packages ]

]

CRAN package page

Each package on CRAN has its own package page containing usefull information on the package.

CRAN package page

Of particular interest to version control and reproducibility are the Depends, Import and Suggests field.

  • Depends - Packages listed here are also loaded at time of package load. Functions are available to user.
  • Imports - Packages are attached and available to package. Functions are not available to user.
  • LinkingTo - Packages are used for their C/C++ header files. Functions are not available to user.
  • Suggests - Packages used in examples or vignette but not required for package functionality.

Packages listed Depends, Imports and LinkingTo are dependencies of this package and if not already installed are installed alongside this package.

Installing CRAN packages

R provides another function from the utils base package which allows us to install a package from the CRAN repositories - install.packages().

In simplicity, to install a package into our R we can just use the install.packages() function supplying the name of the package we want to install.

This will install the package and all required dependencies (the packages listed in DEPENDS and IMPORTS).

install.packages("redist")

SessionInfo again

Now it is installed we can load the library and check the sessionInfo to see where our DEPENDS and IMPORTS packages are within the session information.

library(redist)
sessionInfo()

Setting the R package repositories

By default the first time you install an R package in a new version of R it will ask you which mirror you want to use.

A mirror is simply a copy of the packages databases in a distinct location.

We can induce R to ask us which mirror to use by using the chooseCRANmirror() function.

chooseCRANmirror()

]

]

Mirrors for install.packages()

The install.packages() function can take an additional parameter repos to tell it which repository mirror to use.

We can get a full list of CRAN mirrors from here

Once we have chosen a mirror (typically based on closest location) we can supply the URL to the repos argument of install.packages(). Here will use the University of Michigan mirrow.

install.packages("redis", repos = "https://repo.miserver.it.umich.edu/cran/")

Bioconductor

Another R package repository we will most likely want to make use is the Bioconductor repositories.

Bioconductor focuses on packages related to biological data (analysis, annotation, datasets) and follows many of the conventions set out in CRAN. ]

]

Bioconductor Package page

Bioconductor package pages contain similar information to CRAN pages including the DEPENDS, IMPORTS, SUGGESTS. ]

igv

]

Bioconductor Versions

In contrast to CRAN, Bioconductor packages are released on an every 6 month schedule.

These releases are tied to the yearly release of new R versions.

With every 6 month release we get a new Bioconductor version and all packages in Bioconductor get their minor versions incremented by default.

This means that to make use of new packages you may need to update your version of Bioconductor you are using.

]

]

Installing Bioconductor packages

Every Bioconductor package has instructions on how to install which you can simply copy and paste into the R console.

To do this, Bioconductor members have created the BiocManager package which is itself on CRAN. This allows it to be installed using base R functions and then to manage install the appropriate Biocnductor version of a package.

Biocmanager package

The BiocManager package can be used to install packages from CRAN and Bioconductor

install.packages("BiocManager")
library(BiocManager)
install("DESeq2")

Biocmanager package

It also contains useful functions to manage Bioconductor versions.

Here we check the version of Bioconductor we are using by way of the version() function.

version()

Biocmanager package update

If we want to update all packages to a specific of bioconductor we can set the version argument in the install() function and not set a specific package.

NOTE: Dont do this unless absolutely sure.

install(version = "3.15")

Managing package versions

At this point we can install packages from CRAN and/or Bioconductor and we can use the sessionInfo() argument to list all the package versions we have been using.

So where is the reproducibility problem?

Lets take a look at a very simple example.

Managing package versions

Lets load 2 packages which are very common in Bioinformatics - ggplot2 and DESeq2

library(ggplot2)
library(DESeq2)
sessionInfo()

Managing package versions

By simply loading two libraries i wanted to use, i am now managing the dependencies of these packages as well. This has added up to ~ 70 packages in total.

I need to be able to not just capture versions but rebuild all these packages with the same versions in a new R or most likely for someone else to rebuild all these versions.

Installing package versions

To install a particular package version we can do this by pointing our install.packages function at an archive on a packages CRAN/Bioconductor page.

Additionally i need to supply the parameters repos = NULL and type = “source”

install.packages("https://cran.r-project.org/src/contrib/Archive/unmarked/unmarked_0.8-1.tar.gz",
    repos = NULL, type = "source")

Renv

Renv is a package to create and manage reproducible environments from within R.

Written by Posit/Rstudio, Renv replaces much of the functionality of previous reprodubility packages such as packrat and is easily integrated into project analysis workflows.

Its main use is to capture R and R package versions used within a project and provide functionality to rapodily rebuild these environments. ]

]

New projects with Renv

The best place to start to use Renv is with a new project. We first create a new directory/project for us to work in.

Rstudio in fact offers to set-up new projects with renv but we will skip this for now in order to set this up ourselves.

Next we can install renv package from CRAN.

install.packages("renv")

]

]

Initialise a project

With a new project directory, we can see that we have nothing in the environment and no scripts.

Lets then now initialise renv by using the init() function.

We can see that renv will discover project dependencies and then copy any required packages from your main R libraries to a project specific library cache.

Following this it creates a renv.lock file in the present directory containing R and R package version information.

library(renv)
init()

Project directory

If we now review the project directory we can see renv has included a few essential files and directories

  • .Rprofile - This file is read at project start up and sources the activate script within the renv/ directory
  • renv.lock - This is the file containing all the information on R version and R package versions.
  • renv/ - This directory contains the local copy/cache of project libraries as well as the activate.R script referenced in the .Rprofile and required to set-up renv. ]

Top level directory renv directory ]

The renv.lock file

The renv.lock file contains the most important information for reproducibility in JSON format.

Within the lock file structure we have two sections at the moment.

  1. R version information
  • R version
  • Repositories used and URL for the repository
  1. Package version information
  • package name
  • package version
  • package source
  • package repository
  • package hash

]

renv.lock ]

Versions and Hashes.

Renv uses a HASH of the Description file within a package to assess whether a package has updated.

A Hash simply represents a piece of data with a fixed length string such that if the file/directory is changed a different hash would now represent the file/directory.

In a well maintained package, the Hash should only change with a package version change but this is not always the case so a Hash is a more robust way to check a package is the same as expected.

Adding packages

Now we have started our project we can add a simple ggplot2 plotting script we are using within our analysis within this directory. ]

]

Access required packages

Once a new script is within the directory we can first check whether the script contains any new packages and whether we need to update our renv environment.

We can do this by using the renv status() function.

status()

]

]

Record required packages

Now we have accessed the packages in use within the project but not captured in the lock file we can run the renv snapshot() function to update our lock file.

snapshot()

]

]

Updated lock file

The lock file now contains all the packages required within the project.

We can see we have many more packages than just ggplot2 which was the only package called in the script.

All these remaining packages were DEPENDS or IMPORTS for the ggplot2 package and so have had versions recorded as part of the ggplot2 package install.

]

]

Sharing reproducible environments.

Once we have our final script as well as our renv initialised and up to date we most likely want to share with others so they can repeat the work.

The key components of this to share with others are.

  • The script/s used in analysis
  • renv.lock
  • renv/settings.json
  • .Rprofile
  • renv/activate.R

With this set-up your project should be reproducible with others.

Reproducing from Renv

Now we have sent these files to another user (or shared on GitHub) we can initialise the project on the new computer/system by entering the project directory containing the required renv files and running renv’s restore function.

This will update the project to have all the required packages installed and available to this project.

restore()

]

]

Further Resources

Contact

Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our GitHub and raise an issue.