About Bioconductor

Bioconductor (BioC) is an open source, community driven software project which provides a framework of tools and databases for the analysis of biological data in R.

Started in 2001 with Robert Gentleman.
Gained popularity through microarray analysis packages (MArray/Limma).
Core reviewed and versioned R packages.
Two major releases every year (R has one major release).
Great support for High Throughput sequencing.

Bioconductor Goals

The broad goals of the Bioconductor project are:

To provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
To facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. literature data from PubMed, annotation data from Entrez genes.
To provide a common software platform that enables the rapid development and deployment of extensible, scalable, and interoperable software.
To further scientific understanding by producing high-quality documentation and reproducible research.
To train researchers on computational and statistical methods for the analysis of genomic data.

Bioconductor website

igv

Bioconductor packages

Bioconductor packages can be broadly split into 4 groups.

Software (Tools for the analysis/visualisation of biological data)
Annotation Data (Tools for the integration of biological metadata)
Experiment Data (Example and actual data from experiments)
Workflow

Current Bioconductor Release

Current release is Bioconductor 3.13

Includes nearly 2000 software packages.

All packages have been - Reviewed. - Tested and evaluated automatically. - Actively maintained and updated.

Packages review

Before being accepted to Bioconductor all new packages are reviewed so as to pass Bioconductor guidelines.

Review includes automatic testing of packages

Testing manual and all examples.
Checking code integrity.

Packages review

As well as an open review on Bioconductor github site.

Bioconductor packages

Review ensures

All packages can be built in latest version R.
Examples in reference manual can be evaluated.
Vignette manual code can all be run.

igv

Bioconductor packages

Example package - BasecallQC

igv

Installing Bioconductor and packages

Installing Bioconductor/Bioconductor packages is quite straight forward. Every Bioconductor package has a description of the installation R command we can simply copy and paste.

Install Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.10")

Install a package:

BiocManager::install("basecallQC")

Bioconductor package dependencies

All dependencies and their required versions are resolved for us. We must be careful however to check the version of Bioconductor we are using.

BiocManager::version()

## [1] '3.13'

If we wish to update to latest Bioconductor release we can install Bioconductor again:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.10")

Reference Manual

All packages will have a reference manual containing the help pages for every function.

This will include importantly

Details of functions’ inputs and outputs.
Working examples.

]

igv

]

Vignette

All packages will also include at least one vignette.

These vignettes detail a typical usage of the package with working examples included.

]

igv

]

Genomics data in Bioconductor

Bioconductor packages cover a wide range of biological data types.

In this course we are focusing on high throughput sequencing so we will focus on the main packages for this.

This includes methods for handling common genomics data types.

Fasta and FastQ
BED, BED6 and narrowPeak/broadPeak
GFF
SAM and BAM

FASTA in Bioconductor

Genomic sequences stored as FASTA files are handled using the Biostrings package.

igv

BED/BED6 in Bioconductor

Genomic intervals stored as BED files are handled using the rtracklayer and GenomicRanges packages.

igv

Wigs and BigWigs in Bioconductor

Genomic scores stored as wig or bigWig files are handled using the rtracklayer and GenomicRanges packages.

igv

GFF

GFF files containing gene models are handled using the GenomicFeatures package.

igv

FastQ

FastQ files containing gene models are handled using the ShortRead package.

$igv$

SAM/BAM

SAM and BAM files are handled using the GenomicAlignments package.

igv

Reference Data in Bioconductor

As well as software packages, we know Bioconductor maintains a number of annotation packages.

This includes microarray annotation, gene to ID mappings, genes’ functional annotation, genome sequence information and gene/trancript models.

Gene annotation

Information on model organism’s gene annotation is contained with the org.db packages.

Format is org. species . ID type .db

Homo Sapiens annotation with Entrez Gene IDs – org.Hs.eg.db

Genome Sequence

Genomic sequence information is held within the BSgenome packages.

Format is BSgenome. species. source. major version

Homo Sapiens genome sequence from UCSC’s version hg19 – BSgenome.Hsapiens.UCSC.hg19

igv

Gene Models

Gene models are held in the TxDb packages.

Format is TxDb. species . source . major version . table

Homo Sapiens gene build from UCSC’s version hg19 known gene table – TxDb.Hsapiens.UCSC.hg19.knownGene

igv

Time for an exercise.

Link_to_exercises

Link_to_answers

Introduction to Bioconductor

Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/Bioconductor_Introduction/

About Bioconductor

Bioconductor Goals

Bioconductor website

Bioconductor packages

Current Bioconductor Release

Packages review

Packages review

Bioconductor packages

Bioconductor packages

Installing Bioconductor and packages

Bioconductor package dependencies

Reference Manual

Vignette

Genomics data in Bioconductor

FASTA in Bioconductor

BED/BED6 in Bioconductor

Wigs and BigWigs in Bioconductor

GFF

FastQ

SAM/BAM

Reference Data in Bioconductor

Gene annotation

Genome Sequence

Gene Models

Time for an exercise.