Bioconductor (BioC) is an open source, community driven software project which provides a framework of tools and databases for the analysis of biological data in R.
The broad goals of the Bioconductor project are:
Bioconductor packages can be broadly split into 4 groups.
Current release is Bioconductor 3.13
Includes nearly 2000 software packages.
All packages have been - Reviewed. - Tested and evaluated automatically. - Actively maintained and updated.
Before being accepted to Bioconductor all new packages are reviewed so as to pass Bioconductor guidelines.
Review includes automatic testing of packages
As well as an open review on Bioconductor github site.
Review ensures
Installing Bioconductor/Bioconductor packages is quite straight forward. Every Bioconductor package has a description of the installation R command we can simply copy and paste.
Install Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(version = "3.10") BiocManager
Install a package:
::install("basecallQC") BiocManager
All dependencies and their required versions are resolved for us. We must be careful however to check the version of Bioconductor we are using.
::version() BiocManager
## [1] '3.13'
If we wish to update to latest Bioconductor release we can install Bioconductor again:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(version = "3.10") BiocManager
All packages will have a reference manual containing the help pages for every function.
This will include importantly
]
]
All packages will also include at least one vignette.
These vignettes detail a typical usage of the package with working examples included.
]
]
Bioconductor packages cover a wide range of biological data types.
In this course we are focusing on high throughput sequencing so we will focus on the main packages for this.
This includes methods for handling common genomics data types.
Genomic sequences stored as FASTA files are handled using the Biostrings package.
Genomic intervals stored as BED files are handled using the rtracklayer and GenomicRanges packages.
Genomic scores stored as wig or bigWig files are handled using the rtracklayer and GenomicRanges packages.
As well as software packages, we know Bioconductor maintains a number of annotation packages.
This includes microarray annotation, gene to ID mappings, genes’ functional annotation, genome sequence information and gene/trancript models.
Information on model organism’s gene annotation is contained with the org.db packages.
Format is org. species . ID type .db
Homo Sapiens annotation with Entrez Gene IDs – org.Hs.eg.db
Genomic sequence information is held within the BSgenome packages.
Format is BSgenome. species. source. major version
Homo Sapiens genome sequence from UCSC’s version hg19 – BSgenome.Hsapiens.UCSC.hg19
Gene models are held in the TxDb packages.
Format is TxDb. species . source . major version . table
Homo Sapiens gene build from UCSC’s version hg19 known gene table – TxDb.Hsapiens.UCSC.hg19.knownGene