Genomics Data

Overview

Reference genomes and GRC.
Fasta and FastQ (Unaligned sequences).
SAM/BAM (Aligned sequences).
BED (Genomic Intervals).
GFF/GTF (Gene annotation).
Wiggle files, BEDgraphs and BigWigs (Genomic scores).
VCF and MAF (Genomic variations).
HDF5 and Loom (Experimental measurements and metadata)

Reference Genomes

Are we there yet?

The human genome isnt complete!
In fact, most model organisms’s reference genomes are being regularly updated.
Reference genomes consist of mixture of known chromosomes and unplaced contigs called a Genome Reference Assembly.
Major revisions to assembies result in change of co-ordinates.
Requires conversion between revisions.
The latest genome assembly for humans is GRCh38.
Patches add information to the assembly without disrupting the chromosome coordinates i.e GRCh38.p3

Genome Reference Consortium

GRC is collaboration of institutes which curate and maintain the reference genomes for 3 model organisms.
- Human - GRCh38.p3
- Mouse - GRCm38.p3
- Zebrafish - GRCz10
Other model organisms are maintained separately.
- Drosophila - Berkeley Drosophila Genome Project, BDGP36
Non models organisms are found across repositories including UCSC GenArk

Why do we need reference genomes?

Allows for genes and genomic features to be evaluated in their linear genomic context.
- Gene A is close to Gene B
- Gene A and Gene B are within feature C.
Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.

Aligning to a reference genomes

A reference genome

A reference genome is a collection of contigs.
A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
Typically comes in FASTA format.
- “>” line contains information on contig
- Lines following contain contig sequence

igv

High-throughput Sequencing formats

Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence files.
- FASTQ - Unaligned sequences
- SAM - Aligned sequences

Unaligned Sequences

FastQ (FASTA with Qualities)

$igv$

“@” followed by identifier.
Sequence information.
“+”
Quality scores encodes as ASCI.

Unaligned Sequences

FastQ - Header

igv

Header for each read can contain additional information
- HS2000-887_89 - Machine name.
- 5 - Flowcell lane.
- /1 - Read 1 or 2 of pair (here read 1)

Unaligned Sequences

FastQ - Qualities

igv

Qualities follow “+” line.
-log10 probability of sequence base being wrong.
Encoded in ASCI to save space.
Used in quality assessment and downstream analysis

Aligned sequences

SAM format

SAM - Sequence Alignment Map.
Standard format for sequence data
Recognised by majority of software and browsers.

Aligned sequences

SAM header

igv

SAM header contains information on alignment and contigs used.
- @HD - Version number and sorting information
- @SQ - Contig/Chromosome name and length of sequence.

Aligned sequences

SAM - Aligned reads

igv

Contains read and alignment information and location

Aligned sequences

SAM - Aligned reads

Read name.
Sequence of read.
Encoded sequence quality.

Aligned sequences

SAM - Aligned reads

igv

Chromosome to which read aligns.
Position in chromosome to which 5’ of read aligns.
Alignment information - “Cigar string”.
- 100M - Continuous match of 100 bases
- 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match

Aligned sequences

SAM - Aligned reads

igv

Bit flag - TRUE/FALSE for pre-defined read criteria
- Paired? Duplicate?
- https://broadinstitute.github.io/picard/explain-flags.html
Paired read position and insert size
User defined flags.

Summarised Genomic Features formats.

Summarised Genomic Features formats

Post alignment, sequences reads are typically summarised into scores over/within genomic intervals.
- BED - Genomic intervals and information.
- Wiggle/BedGraph - Genomic intervals and scores.
- GFF - Genomic annotation with information and scores.

Summarising in genomic intervals

** BED format (BED) **

igv

Simple format.
3 tab separated columns.
Chromsome, start, end.

Summarising in genomic intervals

** BED format (BED6) **

igv

Chromosome, start, end.
Identifier.
Score.
Strand (“.” for strandless).

Summarising in genomic intervals

** narrowPeak and broadPeak**

narrowPeak and broadPeak are extensions to BED6 used in Encode’s peak calling.
Contains p-values, q-values.
narrowPeak - BED 6+4.
broadPeak - BED6+3.

Signal at genomic positions

Common practice to review signal over genome.
Special formats exist for this
- Wiggle
- bedGraph

Signal at genomic positions

igv

Information line.
- Chromosome.
- Step size.
Step start position.
Score.

Signal at genomic positions

bedGraph

igv

BED 3 format
- Chromosome
- Start
- End 4th column - Score

Genomic Annotation.

Genomic Annotation

GFF: General Feature Format

igv

Used to genome annotation.
Stores position, feature (exon) and meta-feature (transcript/gene) information.

Genomic Annotation

igv

Chromosome.
Start of feature.
End of Feature.
Strand.

Genomic Annotation

igv

Source.
Feature type.
Score.

Genomic Annotation

igv

Column 9 contains key pairs (ID=exon01), separated by semi-colons “;”.
ID - Feature name.
PARENT- Meta-feature name.

Genomic Variants

Variant Call Format (VCF).
Mutation Annotation Format (MAF).

Variant Call Format

Variant Call Format (VCF) is a text file format (most likely stored in a compressed manner). It contains

meta-information lines.
a header line.
data lines each containing information about a position in the genome .

The format also has the ability to contain genotype information on samples for each position.

VCF Structure

$datasetSource$

Genomic Files for computing

bigWig, bigBED and TABIX

Many programs and browsers deal better with compressed, indexed versions of genomic files
- SAM -> BAM (.bam and index file of .bai)
- Wiggle and bedGraph -> bigWig (.bw/.bigWig)
- BED -> bigBed (.bb)
- BED, VCF and GFF -> (.gz and index file of .tbi)

Experimental assays and metadata

Assays and metadata

The results of many experimental assays are summarised in combinations of tqbles.

For example an experimental result could contain.

A table of measurements with columns as samples and rows as genes.
A sample to metadata (i.e. sample group) relationship table.
A gene to gene information table.
Tables containing transformations of the data.

Formats for experimental results.

Historically we may have stored these results in Spreadsheets.

igv

Efficient experiment results formats.

With bigger datasets we want a format with.

Standards/cross-platform = Easier sharing of data.
Store and connect different data types = Maintain data and metadata information.
Fast and memory efficient retrieval of data = Handle big data queries.

HDF5 (hierarchical Data Format)

Developed by the HDF5 group.
Format and associated set of software libraries.
Cross-platform support.
Self-described format
Fast memory efficent I/O and data operations.
Suitable for very large datasets and associated metadata. ]

]

Single cell data and Loom format

scRNA/scATAC and other single cell datasets can contain large matrices of per cell measurements, associated per cell and per gene metadata as well data transformations.
The Loom file format makes use of and extends the HDF5 format for single cell data.

igv

Based on HDF5 so-

Implemented in multiple languages.
Fast and memory efficient access of scRNA data and metadata.

Getting help and more information

UCSC file formats
- https://genome.ucsc.edu/FAQ/FAQformat.html
IGV file formats
- https://www.broadinstitute.org/igv/FileFormats
Sanger (GFF)
- https://www.sanger.ac.uk/resources/software/gff/spec.html

Reference Genomes and Genomics File Formats

Rockefeller University, Bioinformatics Resource Centre

https://rockefelleruniversity.github.io/Genomic_Data/

Genomics Data

Overview

Reference Genomes

Are we there yet?

Genome Reference Consortium

Why do we need reference genomes?

Aligning to a reference genomes

A reference genome

High-throughput Sequencing formats

High-throughput Sequencing formats

Unaligned Sequences

Unaligned Sequences

Unaligned Sequences

Aligned sequences

Aligned sequences

Aligned sequences

Aligned sequences

Aligned sequences

Aligned sequences

Summarised Genomic Features formats.

Summarised Genomic Features formats

Summarising in genomic intervals

Summarising in genomic intervals

Summarising in genomic intervals

Signal at genomic positions

Signal at genomic positions

Signal at genomic positions

Genomic Annotation.

Genomic Annotation

Genomic Annotation

Genomic Annotation

Genomic Annotation

Genomic Variants

Variant Call Format

VCF Structure

Genomic Files for computing

bigWig, bigBED and TABIX

Experimental assays and metadata

Assays and metadata

Formats for experimental results.

Efficient experiment results formats.

HDF5 (hierarchical Data Format)

Single cell data and Loom format

Getting help and more information