Bioinformatics Resource Center

Thomas Carroll

February 2021

Overview

Who we are.
What is our role.
What have we been doing
- Pipelines and Software
- Training
- Analysis Support

Who we are

BRC formed in July 2017.
Centralized shared resource for bioinformatics support.
Currently 5 members.
Located ~~in IT pavilion.~~ on zoom.

Who we are

(experience)

NGS.
- RNA-seq
  - Bulk RNA
  - Single Cell
- ATAC-seq
- ChIP-seq
- WGS
- Crispr-seq
Proteomics
Image

Who we are

(experience)

Bioinformatics software development in R/Python/Javascript/C++

Who we are

(experience)

Creation and hosting of training courses for bioinformatics data analysis

Role at Rockefeller

Support bioinformatics analysis from project design through to publication.

Identify, evaluate and disseminate emerging bioinformatics techniques.

Role at Rockefeller

What types of analysis?

Preprocessing and processing of data.
Standard downstream analysis.
Custom analysis.

Role at Rockefeller

What types of analysis?

Preprocessing and processing of data.
Standard downstream analysis.
Custom analysis.

What do we need?

Training

Analysis

Pipelines &

Software

Training

Standardised pipelines and software for processing and initial analysis .
Training and protocols for bioinformatics methods.
Customized bioinformatics analysis.

Need a system to maintain
- Analysis code.
- Software/reference data versions used.
- Analysis notes and communications.

Long term reproducibility

Requirements for reproducibile research

Journals such as Nature and Cell lead call for more reproducibility in bioinformatics.
Data submission already a requirement of most peer-reviewed journals (GEO/SRA/ENA).
Code submission now encouraged in many journals and a requirement of a few.

Workflows for high throughput biology

Central effort to automate and optimize processing of common data types.
- Accelerate initial processing and standard analysis of data.
- Produce a version controlled, reproducible set of results and data.
- Reduce copying/coding errors.
With workflow we should -
- Gain better understanding of QC and data characteristics over time.
- Easy to compare methods with old results.
- Pool resources for method and software development.

Irreproducibility in Bioinformatics

Irreproducible results come from multiple sources.
- Copying/coding errors.
- Lack of documentation on versions/software used.

Identifying sources of irreproducibility

Building a workflow for

NGS data analysis

Automate common NGS data processing and initial analysis.
- Retrieve and process data.
- Perform quality control of data.
- Standard analysis of data.
Produce deliverable report summarizing useful data metrics and analysis results.
Scale from small to large datasets.
Reproducible analysis and version control of both software and genomic data.

Analysis of Next Generation Sequencing

NGS is commonly used high throughput technique.
NGS used to study
- Transcriptome/Translatome - RNA-seq and Ribo-seq
- Epigenome - ChIP-seq/exo and ATAC-seq
- Genetic mutation/variation - WGS and Exome-seq.
- Many others.
RNA-seq, ChIP-seq and ATAC-seq are becoming standard techniques in molecular biology.

The NGSpipeR pipeline

R package to analyze RNA-seq/ChIP-seq/ATAC-seq data.
Required genome annotation from Bioconductor packages or user defined.
Version control through packrat.
Scalable from laptop to parallelization on HPC cluster.
Testing through R check/testthat using Travis/Appveyour CI systems.
Dynamic document reporting.

NGSpipeR analysis workflow

Samplesheet template - NGSpipeR

NGSpipeR outputs-

Processed files.
- Raw and aligned sequence data (fastQ/BAM).
- Normalized signal graph (bigWigs).
- Counts in genes/transcripts/exons.
- Transcripts per million (TPM)
- Peaks
Analysis results
- DE Genes
- DU Transcripts
- GO enrichment
- Differential peaks.
- Motif enrichment

Result files

Analysis files
- DEseq2 objects
- rlog normalized data
- Non-redundant peaks and occurrence in samples/groups

NGSpipeR outputs-

Reports

Dynamic document.
- Describes analysis performed.
- Input and output files details.
- QC tables and plots.
- Analysis summaries and visualization.
- Versions used in analysis.

PDF output example

Word/Openoffice example

rMarkdown readily converted to HTML.
Allow for construction of a single page or whole websites.
Can include interactive elements through Javascript.
Interaction with external tools and websites through HTML and ports.

NGSpipeR outputs-

Interactive Reports

NGSpipeR outputs-

Genome Browsing

Text

Common step in NGS analysis is to review data in a genome browser.
- IGV (above) offers a desktop system to review.
- Sample Metadata can be included in a required format.
- High degree of configuration possible through XML sessions.

NGSpipeR -

RNA-seq- Splicing test set

Ptbp1 knock down (Cancer Cell 2018)

HTML output

NGSpipeR Homepage

NGSpipeR

Updates can introduce errors.
Software used can change defaults or introduce their own errors.
Testing with CI systems can help identify errors before data processing.

Keeping the workflow running

Workflows can deliver primary results quickly and reproducibly.
Dynamic documents allow for interactivity in exploration of results.
Post-workflow QC, filtering, re-grouping and re-analysis may be required.
Intermediate files and saved workspaces allow for the regeneration of results under different parameters.

Working downstream of the workflow

A hybrid approach -
- Workflows.
  - Handle CPU/memory intensive steps.
  - Perform highly parallelised steps where no required human intervention.
- A graphical user interface.
  - Review quality metrics, apply filtering/regrouping.
  - Perform exploratory analysis on post processed data.

Building a interactive toolset for

NGS data analysis

Shiny_NGSpipeR

Differential expression and

functional enrichment

Text

Shiny_NGSpipeR -QC

Shiny_NGSpipeR
-Saving outputs and reports

Custom Analysis

Large component of work is unique analysis or combinations of analysis tailored to the hypothesis under investigation.

Downstream analysis of data from workflows.
Custom analysis required to test a particular hypothesis.

Custom Analysis

Broad range of analysis types
- Multi-factor RNA-seq, ATAC-seq, ChIP-seq experiments.
- Single cell RNA-seq.
- Time course analysis of image data.

(Yang R et al, 2020; Xi L et al, 2020; Jove V et al 2020)

Custom Analysis

Use rMarkdown to present analysis of R or other languages.
- Capture versions.
- Record code.
- Disseminate methods to user.

Analysis with Dynamic Documents

Analysis and Code collaboration

Work closely with user in analysis.
- Walk-through analysis concepts in one to one meetings
- Work collaboratively using GitHub.
- Provide support in custom analysis.
- Provide documentation for publication.

Analysis and Code collaboration

Internally use Redmine to
- Analysis code.
- Software/reference data versions used.
- Analysis notes and communications.
- Configure and run NGS pipelines.

Custom Analysis

From Unique to Routine

All methods start as custom analysis.
- Update pipelines to automate the custom when it becomes the common.
- Package to software when code developed is of general use to community.
- Training in popular downstream methods.

Training

Analysis

Pipelines &

Software

Training

Software development

Analysis

Pipelines &

Software

Training

Developing software for ultra high throughput QC of fastQ.
Developed code to make use of and share external software in R/Bioconductor.
User requirement to pre-process CLIP-seq data within a self contained R environment.

Publically available Packaged software

Developed with everyone in BRC.
- Rfastp - Wei Wang
- Herper - Matt Paul
- ClIPflexR - Kathryn Rozen-Gangon (Rice lab) and Ji-Dung Luo

Bioinformatics Training

Different levels of training required.
- Understanding of results or data types.
- Downstream analysis and visualisation techniques.
- Basic programming and in depth analysis techniques.

Analysis

Pipelines &

Software

Training

Structured training program

Progressive training built from scratch.
- Courses linked by both data and techniques.
Data reflecting a real world analysis, not toy examples.

Intro to R

Genomic

Files

RNA-seq

Analysis

Alignments

Bioinformatics Training Courses

Since publications must be reproducible
- Data is available in public repositories
- Methods applied to data is well documented
Structure bioinformatics courses around published data.
- Retrieve publicly available data.
- Re-create results within publications.
- Expand on results found in publication.
Learn to evaluate published bioinformatics results.

Training in context

Course Program

Training material

Automation and testing

Maintaining course material is time consuming
- Software version changes.
- Recompiling of slides, handouts, manual etc.
- Commits to material introduce errors.
Use automation to test and compile courses
- Test all material on 3 major operating systems
  - Windows/MacOS/Linux(Ubuntu).
- Generate multiple formats of training material
  - Slides/Single Web page/Code
- Create training website (i.e. installation instructions, software requirements ), build training package for download.

In class and online

Teach in a classroom
Make material available online for reference.
Create training material from central rMarkdown file.
- Presentation slides.
- PDF handouts.
- Interactive training.

A Rockefeller Bioinformatics Reference Manual

Courses run

Regularly updated to new methods.
- Version 3 added
  - CrispR-seq analysis.
  - Motif analysis.
  - Epigenetic heat maps.
Graduate School - June to September 2020
Open course - May to August 2020
- At peak ~ 140 across two sessions.
- ~100hrs of video!!

Courses upcoming

Graduate school - Summer 2021
Open course (compressed)
- Sign up at - http://bit.ly/brc_training
  - March 1st to 3rd - Introduction to R
  - March 8th to 10th - Introduction to Bioconductor
  - March 15th to 19th - Applied Bioinformatics
  - March 22nd to 23rd - Genome assembly
    - Upcoming course on genome assembly from Reference Genome Center!

New upcoming courses
- Single Cell analysis primer
- Introduction to Github
- Introduction to Rcpp (R and C++)

Collaborators welcome!
- Templates available for course creation.

Courses run and upcoming

Regularly updated to new methods.
- Version 2 added
  - Excel import/export
  - fGSEA
  - Deeptools and profileplyr
Graduate School.
- 13 weeks,
  - 2 hrs lectures
  - 4 hrs exercises
- Summer 2018
- Fall 2018
April/May bioinformatics course
- Compressed course.

Training - Local community

Co-organise New York City R and Bioconductor meet-ups
Hosts more advanced and specialized training.
Workshops on Thursday/Friday evenings covering topics from protein-protein interactions analysis to metagenomics.

Training - Global community

BRC co-hosting Bioconductor conference 2019
- Main conference at Rockefeller University.
- Developer day at NYU Langone.
Talks on latest techniques in bioinformatics and hands on workshops in analysis of data in R and Bioconductor.
50% discount for Rockefeller employees and students.

Resources and more information

Rockefeller internal information
- http://inside.rockefeller.edu/bioinformatics
Public site
- https://rockefelleruniversity.github.io/
- https://www.rockefeller.edu/bioinformatics/
New York City R and Bioconductor Meet-ups
- https://www.meetup.com/BiocNYC/

Talk's resources

For more information

On our graphical user interface
On our software
On our training
For analysis support

Please email us at

brc@rockefeller.edu

Link for March training sign-up

http://bit.ly/brc_training

All these slides are on our website

Thanks to

Office of the CIO

Anthony Carvalloza

Genomics Resource Centre

Connie Zhao

Sophie Huang

Bioimaging Resource Center

Alison North

Reference Genome Resource Center

Olivier Fedrigo

Giulio Formenti

Bioinformatics Resource Center

Overview

Who we are

Who we are

Who we are

Who we are

Role at Rockefeller

Role at Rockefeller

Role at Rockefeller

What do we need?

Long term reproducibility

Requirements for reproducibile research

Workflows for high throughput biology

Irreproducibility in Bioinformatics

Identifying sources of irreproducibility

Building a workflow for

NGS data analysis

Analysis of Next Generation Sequencing

The NGSpipeR pipeline

NGSpipeR analysis workflow

NGSpipeR outputs-

Result files

NGSpipeR outputs-

Reports

PDF output example

Word/Openoffice example

NGSpipeR outputs-

Interactive Reports

NGSpipeR outputs-

Genome Browsing

NGSpipeR -

NGSpipeR Homepage

NGSpipeR

Keeping the workflow running

Working downstream of the workflow

Building a interactive toolset for

NGS data analysis

Shiny_NGSpipeR

Shiny_NGSpipeR

Shiny_NGSpipeR -QC

Shiny_NGSpipeR -Saving outputs and reports

Custom Analysis

Custom Analysis

Broad range of analysis types

Custom Analysis

Analysis with Dynamic Documents

Analysis and Code collaboration

Analysis and Code collaboration

Custom Analysis

From Unique to Routine

Software development

Publically available Packaged software

Bioinformatics Training

Structured training program

Bioinformatics Training Courses

Training in context

Course Program

Training material

In class and online

A Rockefeller Bioinformatics Reference Manual

Courses run

Courses upcoming

Courses run and upcoming

Training - Local community

Training - Global community

Resources and more information

Talk's resources

Thanks to

You!

Thanks to

Thanks to

Shiny_NGSpipeR
-Saving outputs and reports