All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived Reproducible_R folder in the Rstudio menu.
Session -> Set Working Directory -> Choose Directory
or in the console.
setwd("/PathToMyDownload/Reproducible_R-master/r_course")
# e.g. setwd('~/Downloads/Reproducible_R-master/r_course')
Good science is built on being reproducible. This does not end at the bench.
People often face reproducibility issues when trying to repeat someone else’s analysis (or even their own).
In theory it should be easy. Just take the raw data and reprocess it in the same way. But there are often gaps in documentation.
New funding applications now require a more extensive data management plan. Though this doesn’t seem to have strong teeth for code, the policy does say:
“Related Tools, Software and/or Code Indicate whether specialized tools are needed to access or manipulate shared scientific data to support replication or reuse, and name(s) of the needed tool(s) and software. If applicable, specify how needed tools can be accessed.”
This fits with their ethos of data following the FAIR principles of being: Findable, Accesible, Interoperable, and Reusable. Over time we can see this becoming a more necessary part of published work.
Often where people struggle to repeat work, it is not to do with issues with the data. It is instead bad documentation of the workflow. By accurately recording what packages and versions were used we circumvent a lot of these issues.
R contains a ton of useful tools and functions which allow us to perfrom more complex operations out of the box.
We get access to these when R loads without having to install or load any packages.
These include arithmetic operations, functions for reading/writing data and utlity functions.
<- matrix(1:100, ncol = 10, byrow = TRUE)
myMat <- colMeans(myMat)
means write.csv(myMat, file = "Test.csv")
We get access to these functions through base packages loaded as R starts.
These base packages include
We can get a comprehensive list of functions available in these packages using R help.
We can type ?base into R and get an index of all functions in the base package.
?base
If we review the help index we can see that the version of the base R package matches the version of R we are using.
Versioning of software allows developers to make updates to their software while maintaining availability/trackability of previous versions of their software.
This also allows users to track the exact version of software they use to maintain essential reproducibility within their own work.
The current version of the R software is 4.3.1
R package versions follow the convention of 3 numbers representing the major, minor and patch versions of the software.
PackageName major_._minor_._patch
e.g. ggplot2 current version is ggplot2 - 3_._4_._2
To find current package versions of libraries we are using in the current R session as well as information on the version of R in use we can take advantage of functions within another base package - the utils package.
The utils package contains many functions on package installation and version management.
To get a look at all the available functions in base we can use the R help again
?utils
One very useful function within the utils package is sessionInfo().
First we can run sessionInfo() function with no arguments.
sessionInfo()
If we load a few more packages into R we can see this reflected in the sessionInfo() output under the other attached packages.
library(ggplot2)
library(dplyr)
sessionInfo()
We can also use the sessionInfo function to provide the version information on a specified package.
We simply need to add an argument to the sessionInfo function call of the package name and the specified package’s versions is shown under the other attached packages section.
sessionInfo("ggplot2")
The sessionInfo function returns an object of class sessioninfo containing the R/package version information.
<- sessionInfo()
sess_info class(sess_info)
## [1] "sessionInfo"
names(sess_info)
## [1] "R.version" "platform" "locale" "tzone" "tzcode_type"
## [6] "running" "RNGkind" "basePkgs" "otherPkgs" "loadedOnly"
## [11] "matprod" "BLAS" "LAPACK" "LA_version"
Rstudio makes package version discovery easy using the packages panel
Here you can see -
]
]
Although we have access to all the functions within the base packages by default we most likely will want to take advantage functions in the 1000s of packages available in the many R package repositories.
Package repositories hold collections of packages under differing versions available for download or review.
The two most popular R package repositories are:-
CRAN , Comprehensive R Archive Network, was the first and is the most popular R package repository.
It takes it’s name from CTAN, Comprehensive Tex Archive Network, and inspiration from CPAN, Comprehensive Perl Archive Network.
CRAN provides both R software download links as well as hosts versioned R packages ]
]
Each package on CRAN has its own package page containing usefull information on the package.
Of particular interest to version control and reproducibility are the Depends, Import and Suggests field.
Packages listed Depends, Imports and LinkingTo are dependencies of this package and if not already installed are installed alongside this package.
R provides another function from the utils base package which allows us to install a package from the CRAN repositories - install.packages().
In simplicity, to install a package into our R we can just use the install.packages() function supplying the name of the package we want to install.
This will install the package and all required dependencies (the packages listed in DEPENDS and IMPORTS).
install.packages("redist")
Now it is installed we can load the library and check the sessionInfo to see where our DEPENDS and IMPORTS packages are within the session information.
library(redist)
sessionInfo()
By default the first time you install an R package in a new version of R it will ask you which mirror you want to use.
A mirror is simply a copy of the packages databases in a distinct location.
We can induce R to ask us which mirror to use by using the chooseCRANmirror() function.
chooseCRANmirror()
]
]
The install.packages() function can take an additional parameter repos to tell it which repository mirror to use.
We can get a full list of CRAN mirrors from here
Once we have chosen a mirror (typically based on closest location) we can supply the URL to the repos argument of install.packages(). Here will use the University of Michigan mirrow.
install.packages("redis", repos = "https://repo.miserver.it.umich.edu/cran/")
Another R package repository we will most likely want to make use is the Bioconductor repositories.
Bioconductor focuses on packages related to biological data (analysis, annotation, datasets) and follows many of the conventions set out in CRAN. ]
]
Bioconductor package pages contain similar information to CRAN pages including the DEPENDS, IMPORTS, SUGGESTS. ]
]
In contrast to CRAN, Bioconductor packages are released on an every 6 month schedule.
These releases are tied to the yearly release of new R versions.
With every 6 month release we get a new Bioconductor version and all packages in Bioconductor get their minor versions incremented by default.
This means that to make use of new packages you may need to update your version of Bioconductor you are using.
]
]
Every Bioconductor package has instructions on how to install which you can simply copy and paste into the R console.
To do this, Bioconductor members have created the BiocManager package which is itself on CRAN. This allows it to be installed using base R functions and then to manage install the appropriate Biocnductor version of a package.
The BiocManager package can be used to install packages from CRAN and Bioconductor
install.packages("BiocManager")
library(BiocManager)
install("DESeq2")
It also contains useful functions to manage Bioconductor versions.
Here we check the version of Bioconductor we are using by way of the version() function.
version()
If we want to update all packages to a specific of bioconductor we can set the version argument in the install() function and not set a specific package.
NOTE: Dont do this unless absolutely sure.
install(version = "3.15")
At this point we can install packages from CRAN and/or Bioconductor and we can use the sessionInfo() argument to list all the package versions we have been using.
So where is the reproducibility problem?
Lets take a look at a very simple example.
Lets load 2 packages which are very common in Bioinformatics - ggplot2 and DESeq2
library(ggplot2)
library(DESeq2)
sessionInfo()
By simply loading two libraries i wanted to use, i am now managing the dependencies of these packages as well. This has added up to ~ 70 packages in total.
I need to be able to not just capture versions but rebuild all these packages with the same versions in a new R or most likely for someone else to rebuild all these versions.
To install a particular package version we can do this by pointing our install.packages function at an archive on a packages CRAN/Bioconductor page.
Additionally i need to supply the parameters repos = NULL and type = “source”
install.packages("https://cran.r-project.org/src/contrib/Archive/unmarked/unmarked_0.8-1.tar.gz",
repos = NULL, type = "source")
Renv is a package to create and manage reproducible environments from within R.
Written by Posit/Rstudio, Renv replaces much of the functionality of previous reprodubility packages such as packrat and is easily integrated into project analysis workflows.
Its main use is to capture R and R package versions used within a project and provide functionality to rapodily rebuild these environments. ]
]
The best place to start to use Renv is with a new project. We first create a new directory/project for us to work in.
Rstudio in fact offers to set-up new projects with renv but we will skip this for now in order to set this up ourselves.
Next we can install renv package from CRAN.
install.packages("renv")
]
]
With a new project directory, we can see that we have nothing in the environment and no scripts.
Lets then now initialise renv by using the init() function.
We can see that renv will discover project dependencies and then copy any required packages from your main R libraries to a project specific library cache.
Following this it creates a renv.lock file in the present directory containing R and R package version information.
library(renv)
init()
If we now review the project directory we can see renv has included a few essential files and directories
Top level directory renv directory ]
The renv.lock file contains the most important information for reproducibility in JSON format.
Within the lock file structure we have two sections at the moment.
]
renv.lock ]
Renv uses a HASH of the Description file within a package to assess whether a package has updated.
A Hash simply represents a piece of data with a fixed length string such that if the file/directory is changed a different hash would now represent the file/directory.
In a well maintained package, the Hash should only change with a package version change but this is not always the case so a Hash is a more robust way to check a package is the same as expected.
Now we have started our project we can add a simple ggplot2 plotting script we are using within our analysis within this directory. ]
]
Once a new script is within the directory we can first check whether the script contains any new packages and whether we need to update our renv environment.
We can do this by using the renv status() function.
status()
]
]
Now we have accessed the packages in use within the project but not captured in the lock file we can run the renv snapshot() function to update our lock file.
snapshot()
]
]
The lock file now contains all the packages required within the project.
We can see we have many more packages than just ggplot2 which was the only package called in the script.
All these remaining packages were DEPENDS or IMPORTS for the ggplot2 package and so have had versions recorded as part of the ggplot2 package install.
]
]
Now we have sent these files to another user (or shared on GitHub) we can initialise the project on the new computer/system by entering the project directory containing the required renv files and running renv’s restore function.
This will update the project to have all the required packages installed and available to this project.
restore()
]
]
Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our GitHub and raise an issue.