Bioinformatics Resource Center

Thomas Carroll

February 2021

Overview

  • Who we are.
  • What is our role.
  • What have we been doing
    • Pipelines and Software
    • Training
    • Analysis Support

Who we are

  • BRC formed in July 2017.
  • Centralized shared resource for bioinformatics support.
  • Currently 5 members.
  • Located in IT pavilion. on zoom.

Who we are

(experience)

  • NGS.
    • RNA-seq
      • Bulk RNA
      • Single Cell
    • ATAC-seq
    • ChIP-seq
    • WGS
    • Crispr-seq
  • Proteomics
  • ​Image​

Who we are

(experience)

  • Bioinformatics software development in R/Python/Javascript/C++

Who we are

(experience)

  • Creation and hosting of training courses for bioinformatics data analysis

Role at Rockefeller

  • Support bioinformatics analysis from project design through to publication.

 

  • ​Identify, evaluate and disseminate emerging bioinformatics techniques.

Role at Rockefeller

What types of analysis?

  • Preprocessing and processing of data.
  • Standard downstream analysis.
  • Custom analysis.

Role at Rockefeller

What types of analysis?

  • Preprocessing and processing of data.
  • Standard downstream analysis.
  • Custom analysis.

What do we need?

Training

Analysis

Pipelines &

Software

Training

  • Standardised pipelines and software for processing and initial analysis .
  • Training and protocols for bioinformatics methods.
  • Customized  bioinformatics analysis.
  • Need a system to maintain
    • Analysis code.
    • Software/reference data versions used.
    • Analysis notes and communications.

Long term reproducibility

Requirements for reproducibile research

  • Journals such as Nature and Cell lead call for more reproducibility in bioinformatics.
  • Data submission already a requirement of most peer-reviewed journals (GEO/SRA/ENA).
  • Code submission now encouraged in many journals and a requirement of a few.

Workflows for high throughput biology

  • Central effort to automate and optimize processing of common data types.
    • Accelerate initial processing and standard analysis of data.
    • Produce a version controlled, reproducible set of results and data.
    • Reduce copying/coding errors.
  • With workflow we should -
    • Gain better understanding of QC and data characteristics over time.
    • Easy to compare methods with old results.
    • Pool resources for method and software development.

Irreproducibility in Bioinformatics

  • Irreproducible results come from multiple sources.
    • Copying/coding errors.
    • Lack of documentation on versions/software used.

Identifying sources of irreproducibility

Building a workflow for

NGS data analysis

  • Automate common NGS data processing and initial analysis.
    • Retrieve and process data.
    • Perform quality control of data.
    • Standard analysis of data.
  • Produce deliverable report summarizing useful data metrics and analysis results.
  • Scale from small to large datasets.
  • Reproducible analysis and version control of both software and genomic data.

Analysis of Next Generation Sequencing

  • NGS is commonly used high throughput technique.
  • NGS used to study 
    • Transcriptome/Translatome - RNA-seq and Ribo-seq
    • Epigenome - ChIP-seq/exo and ATAC-seq
    • Genetic mutation/variation - WGS and Exome-seq.
    • Many others.
  • RNA-seq, ChIP-seq and ATAC-seq are becoming standard techniques in molecular biology.

The NGSpipeR pipeline

  • R package to analyze RNA-seq/ChIP-seq/ATAC-seq data.
  • Required genome annotation from Bioconductor packages or user defined.
  • Version control through packrat.
  • Scalable from laptop to parallelization on HPC cluster.
  • Testing through R check/testthat using Travis/Appveyour CI systems.
  • Dynamic document reporting.

NGSpipeR analysis workflow

Samplesheet template - NGSpipeR

NGSpipeR outputs-

  • Processed files.
    • Raw and aligned sequence data (fastQ/BAM).
    • Normalized signal graph (bigWigs).
    • Counts in genes/transcripts/exons.
    • Transcripts per million (TPM)
    • Peaks
  • Analysis results
    • DE Genes
    • DU Transcripts
    • GO enrichment
    •  Differential peaks.
    • Motif enrichment

Result files

  • Analysis files
    • ​DEseq2 objects
    • rlog normalized data
    • Non-redundant peaks and occurrence in samples/groups

NGSpipeR outputs-

Reports

  • Dynamic document.
    • Describes analysis performed.
    • Input and output files details.
    • QC tables and plots.
    • Analysis summaries and visualization.
    • Versions used in analysis.

PDF output example

Word/Openoffice example

  • rMarkdown readily converted to HTML.
  • Allow for construction of a single page or whole websites.
  • Can include interactive elements through Javascript.
  • Interaction with external tools and websites through HTML and ports.

NGSpipeR outputs-

Interactive Reports

NGSpipeR outputs-

Genome Browsing

Text

  • Common step in NGS analysis is to review data in a genome browser.
    • IGV (above) offers a desktop system to review.
    • Sample Metadata can be included in a required format.
    • High degree of configuration possible through XML sessions.

NGSpipeR -

RNA-seq- Splicing test set

Ptbp1 knock down (Cancer Cell 2018)

HTML output

NGSpipeR Homepage

NGSpipeR

  • Updates can introduce errors.
  • Software used can change defaults or introduce their own errors.
  • Testing with CI systems can help identify errors before data processing.

Keeping the workflow running

  • Workflows can deliver primary results quickly and reproducibly.
  • Dynamic documents allow for interactivity in exploration of results.
  • Post-workflow QC, filtering, re-grouping and re-analysis may be required.
  • Intermediate files and saved workspaces allow for the regeneration of results under different parameters.

Working downstream of the workflow

  • A hybrid approach - 
    • Workflows.
      • Handle CPU/memory intensive steps.
      • Perform highly parallelised steps where no required human intervention.
    • A graphical user interface.
      • Review quality metrics, apply filtering/regrouping.
      • Perform exploratory analysis on post processed data.

Building a interactive toolset for

NGS data analysis

Shiny_NGSpipeR

Shiny_NGSpipeR

Differential expression and

functional enrichment

Text

Shiny_NGSpipeR -QC

Shiny_NGSpipeR
-Saving outputs and reports

Custom Analysis

Large component of work is unique analysis or combinations of analysis tailored to the hypothesis under investigation.

  • Downstream analysis of data from workflows.
  • Custom analysis required to test a particular hypothesis.

Custom Analysis

  • Broad range of analysis types

    • Multi-factor RNA-seq, ATAC-seq, ChIP-seq experiments.
    • Single cell RNA-seq.
    • Time course analysis of image data.

(Yang R et al, 2020; Xi L et al, 2020; Jove V et al 2020)

Custom Analysis

  • Use rMarkdown to present analysis of R or other languages.
    • Capture versions.
    • Record code.
    • Disseminate methods to user.

Analysis with Dynamic Documents

Analysis and Code collaboration

  • ​Work closely with user in analysis.
    • Walk-through analysis concepts in one to one meetings
    • Work collaboratively using GitHub.
    • Provide support in custom analysis.
    • Provide documentation for publication.

Analysis and Code collaboration

  • Internally use Redmine to
    • Analysis code.
    • Software/reference data versions used.
    • Analysis notes and communications.
    • Configure and run NGS pipelines.

Custom Analysis

From Unique to Routine

  • All methods start as custom analysis.
    • Update pipelines to automate the custom when it becomes the common.
    • Package to software when code developed is of general use  to community.
    • Training in popular downstream methods.

Training

Analysis

Pipelines &

Software

Training

Software development

Analysis

Pipelines &

Software

Training

  • Developing software for ultra high throughput QC of fastQ.
  • Developed code to make use of and share external software in  R/Bioconductor.
  • User requirement to pre-process CLIP-seq data within a self contained R environment.

Publically available Packaged software

  • Developed with everyone in BRC.
    • Rfastp - Wei Wang
    • Herper - Matt Paul
    • ClIPflexR - Kathryn Rozen-Gangon (Rice lab) and Ji-Dung Luo

Bioinformatics Training

  • Different levels of training required.
    • Understanding of results or data types. 
    • Downstream analysis and visualisation techniques.
    • Basic programming and in depth analysis techniques.

Analysis

Pipelines &

Software

Training

Structured training program

  • Progressive training built from scratch.
    • Courses linked by both data and techniques.
  • Data reflecting a real world analysis, not toy examples.

Intro to R

Genomic

Files

RNA-seq 

Analysis

Alignments

Bioinformatics Training Courses

 

  • Since publications must be reproducible
    • Data is available in public repositories
    • Methods applied to data is well documented
  • Structure bioinformatics courses around published data.
    • Retrieve publicly available data.
    • Re-create results within publications.
    • Expand on results found in publication.
  • Learn to evaluate published bioinformatics results.

Training in context

Course Program

Training material

Automation and testing

  • Maintaining course material is time consuming
    • Software version changes.
    • Recompiling of slides, handouts, manual etc.
    • Commits to material introduce errors.
  • Use automation to test and compile courses
    • Test all material on 3 major operating systems
      • Windows/MacOS/Linux(Ubuntu).
    • Generate multiple formats of training material
      • Slides/Single Web page/Code
    • ​Create training website (i.e. installation instructions, software requirements ), build training package for download.

​

In class and online

  • Teach in a classroom 
  • Make material available online for reference.
  • Create training material from central rMarkdown file.
    • Presentation slides.
    • PDF handouts.
    • Interactive training.

A Rockefeller Bioinformatics Reference Manual

Courses run

  • Regularly updated to new methods.
    • Version 3 added
      • CrispR-seq analysis.
      • Motif analysis.
      • Epigenetic heat maps.
  • Graduate School - June to September 2020
  • Open course - May to August  2020
    • At peak ~ 140 across two sessions.
    • ~100hrs of video!!

Courses upcoming

  • Graduate school - Summer 2021
  • Open course (compressed)
    • Sign up at - http://bit.ly/brc_training
      • March 1st to 3rd - Introduction to R
      • March 8th to 10th - Introduction to Bioconductor
      • March 15th to 19th - Applied Bioinformatics
      • March 22nd to 23rd - Genome assembly
        • Upcoming course on genome assembly from Reference Genome Center!
  • New upcoming courses
    • Single Cell analysis primer
    • Introduction to Github
    • Introduction to Rcpp (R and C++)
  • ​Collaborators welcome!
    • Templates available for course creation.

Courses run and upcoming

  • Regularly updated to new methods.
    • Version 2 added
      • Excel import/export
      • fGSEA
      • Deeptools and profileplyr
  • Graduate School.
    • 13 weeks,
      • 2 hrs lectures
      • 4 hrs exercises
    • Summer 2018
    • Fall 2018
  • April/May bioinformatics course
    • Compressed course.

Training - Local community

  • Co-organise New York City R and Bioconductor meet-ups
  • Hosts more advanced and specialized training.
  • Workshops on Thursday/Friday evenings covering topics from protein-protein interactions analysis to metagenomics.

Training - Global community

  • BRC co-hosting Bioconductor conference 2019
    • Main conference at Rockefeller University.
    • Developer day at NYU Langone.
  • Talks on latest techniques in bioinformatics and hands on workshops in analysis of data in R and Bioconductor.
  • 50% discount for Rockefeller employees and students.

Resources and more information

  • Rockefeller internal information
    • http://inside.rockefeller.edu/bioinformatics
  • Public site
    • ​https://rockefelleruniversity.github.io/
    • https://www.rockefeller.edu/bioinformatics/
  • New York City R and Bioconductor Meet-ups
    • https://www.meetup.com/BiocNYC/

Talk's resources

For more information

  • On our graphical user interface
  • On our software
  • On our training
  • For analysis support

Please email us at

brc@rockefeller.edu

  • Link for March training sign-up

http://bit.ly/brc_training

 

All these slides are on our website

 

Thanks to

Office of the CIO

Anthony Carvalloza

Genomics Resource Centre

Connie Zhao

Sophie  Huang

Bioimaging Resource Center

Alison North

Reference Genome Resource Center

Olivier Fedrigo

Giulio Formenti

You!

Thanks to

Thanks to