class: inverse, center, middle
Genomics Data
Overview
- Reference genomes and GRC.
- Fasta and FastQ (Unaligned sequences).
- SAM/BAM (Aligned sequences).
- BED (Genomic Intervals).
- GFF/GTF (Gene annotation).
- Wiggle files, BEDgraphs and BigWigs (Genomic scores).
- VCF and MAF (Genomic variations).
- HDF5 and Loom (Experimental measurements and metadata)
class: inverse, center, middle
Reference Genomes
Are we there yet?
- The human genome isnt complete!
- In fact, most model organisms’s reference genomes are being regularly updated.
- Reference genomes consist of mixture of known chromosomes and unplaced contigs called a Genome Reference Assembly.
- Major revisions to assembies result in change of co-ordinates.
- Requires conversion between revisions.
- The latest genome assembly for humans is GRCh38.
- Patches add information to the assembly without disrupting the chromosome coordinates i.e GRCh38.p3
Genome Reference Consortium
- GRC is collaboration of institutes which curate and maintain the reference genomes for 3 model organisms.
- Human - GRCh38.p3
- Mouse - GRCm38.p3
- Zebrafish - GRCz10
- Other model organisms are maintained separately.
- Drosophila - Berkeley Drosophila Genome Project, BDGP36
Why do we need reference genomes?
- Allows for genes and genomic features to be evaluated in their linear genomic context.
- Gene A is close to Gene B
- Gene A and Gene B are within feature C.
- Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.
Aligning to a reference genomes
A reference genome
- A reference genome is a collection of contigs.
- A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
- Typically comes in FASTA format.
- “>” line contains information on contig
- Lines following contain contig sequence
class: inverse, center, middle
Genomic Annotation.
Genomic Annotation
GFF: General Feature Format
- Used to genome annotation.
- Stores position, feature (exon) and meta-feature (transcript/gene) information.
Genomic Annotation
- Chromosome.
- Start of feature.
- End of Feature.
- Strand.
Genomic Annotation
- Source.
- Feature type.
- Score.
Genomic Annotation
- Column 9 contains key pairs (ID=exon01), separated by semi-colons “;”.
- ID - Feature name.
- PARENT- Meta-feature name.
Genomic Variants
- Variant Call Format (VCF).
- Mutation Annotation Format (MAF).
MAF Structure
class: inverse, center, middle
Genomic Files for computing
bigWig, bigBED and TABIX
- Many programs and browsers deal better with compressed, indexed versions of genomic files
- SAM -> BAM (.bam and index file of .bai)
- Wiggle and bedGraph -> bigWig (.bw/.bigWig)
- BED -> bigBed (.bb)
- BED, VCF and GFF -> (.gz and index file of .tbi)
class: inverse, center, middle