class: center, middle, inverse, title-slide # Reference Genomes and Genomics File Formats ### Rockefeller University, Bioinformatics Resource Centre ###
https://rockefelleruniversity.github.io/Genomic_Data/
--- class: inverse, center, middle # Genomics Data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Overview - Reference genomes and GRC. - Fasta and FastQ (Unaligned sequences). - SAM/BAM (Aligned sequences). - BED (Genomic Intervals). - GFF/GTF (Gene annotation). - Wiggle files, BEDgraphs and BigWigs (Genomic scores). - VCF and MAF (Genomic variations). - HDF5 and Loom (Experimental measurements and metadata) --- class: inverse, center, middle # Reference Genomes <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Are we there yet? - The human genome isnt complete! - In fact, most model organisms's reference genomes are being regularly updated. - Reference genomes consist of mixture of known chromosomes and unplaced contigs called a **Genome Reference Assembly**. - Major revisions to assembies result in change of co-ordinates. - Requires conversion between revisions. - The latest genome assembly for humans is GRCh38. - Patches add information to the assembly without disrupting the chromosome coordinates i.e GRCh38.p3 --- ## Genome Reference Consortium - GRC is collaboration of institutes which curate and maintain the reference genomes for 3 model organisms. - Human - GRCh38.p3 - Mouse - GRCm38.p3 - Zebrafish - GRCz10 - Other model organisms are maintained separately. - Drosophila - Berkeley Drosophila Genome Project, BDGP36 --- ## Why do we need reference genomes? - Allows for genes and genomic features to be evaluated in their linear genomic context. - Gene A is close to Gene B - Gene A and Gene B are within feature C. - Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome. --- ## Aligning to a reference genomes ![](imgs/alignToRef.png) --- ## A reference genome - A reference genome is a collection of contigs. - A contig is a stretch of DNA sequence encoded as A,G,C,T,N. - Typically comes in FASTA format. - ">" line contains information on contig - Lines following contain contig sequence <div align="center"> <img src="imgs/fasta.png" alt="igv" height="300" width="600"> </div> --- class: inverse, center, middle # High-throughput Sequencing formats <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## High-throughput Sequencing formats - Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence files. - FASTQ - Unaligned sequences - SAM - Aligned sequences --- ## Unaligned Sequences **FastQ (FASTA with Qualities)** <div align="center"> <img src="imgs/fq1.png" alt="igv" height="200" width="800"> </div> - "@" followed by identifier. - Sequence information. - "+" - Quality scores encodes as ASCI. --- ## Unaligned Sequences **FastQ - Header** <div align="center"> <img src="imgs/fq2.png" alt="igv" height="200" width="800"> </div> - Header for each read can contain additional information - HS2000-887_89 - Machine name. - 5 - Flowcell lane. - /1 - Read 1 or 2 of pair (here read 1) --- ## Unaligned Sequences **FastQ - Qualities** <div align="center"> <img src="imgs/fq3.png" alt="igv" height="200" width="800"> </div> - Qualities follow "+" line. - -log10 probability of sequence base being wrong. - Encoded in ASCI to save space. - Used in quality assessment and downstream analysis --- ## Aligned sequences **SAM format** - SAM - Sequence Alignment Map. - Standard format for sequence data - Recognised by majority of software and browsers. --- ## Aligned sequences **SAM header** .pull-left[ <div align="center"> <img src="imgs/sam1.png" alt="igv" height="400" width="400"> </div> ] .pull-right[ - SAM header contains information on alignment and contigs used. - @HD - Version number and sorting information - @SQ - Contig/Chromosome name and length of sequence. ] --- ## Aligned sequences **SAM - Aligned reads** <div align="center"> <img src="imgs/sam2.png" alt="igv" height="200" width="800"> </div> - Contains read and alignment information and location --- ## Aligned sequences **SAM - Aligned reads** <div align="center"> <img src="imgs/sam3.png" alt="igv" height="200" width="800"> </div> - Read name. - Sequence of read. - Encoded sequence quality. --- ## Aligned sequences **SAM - Aligned reads** <div align="center"> <img src="imgs/sam4.png" alt="igv" height="200" width="800"> </div> - Chromosome to which read aligns. - Position in chromosome to which 5' of read aligns. - Alignment information - "Cigar string". - 100M - Continuous match of 100 bases - 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match --- ## Aligned sequences **SAM - Aligned reads** <div align="center"> <img src="imgs/sam5.png" alt="igv" height="200" width="800"> </div> - Bit flag - TRUE/FALSE for pre-defined read criteria - Paired? Duplicate? - https://broadinstitute.github.io/picard/explain-flags.html - Paired read position and insert size - User defined flags. --- class: inverse, center, middle # Summarised Genomic Features formats. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Summarised Genomic Features formats - Post alignment, sequences reads are typically summarised into scores over/within genomic intervals. - BED - Genomic intervals and information. - Wiggle/BedGraph - Genomic intervals and scores. - GFF - Genomic annotation with information and scores. --- ## Summarising in genomic intervals ** BED format (BED) ** <div align="center"> <img src="imgs/BED3.png" alt="igv" height="200" width="400"> </div> - Simple format. - 3 tab separated columns. - Chromsome, start, end. --- ## Summarising in genomic intervals ** BED format (BED6) ** <div align="center"> <img src="imgs/bed6.png" alt="igv" height="200" width="400"> </div> - Chromosome, start, end. - Identifier. - Score. - Strand ("." for strandless). --- ## Summarising in genomic intervals ** narrowPeak and broadPeak** - narrowPeak and broadPeak are extensions to BED6 used in Encode's peak calling. - Contains p-values, q-values. - narrowPeak - BED 6+4. - broadPeak - BED6+3. --- ## Signal at genomic positions - Common practice to review signal over genome. - Special formats exist for this - Wiggle - bedGraph --- ## Signal at genomic positions .pull-left[ <div align="center"> <img src="imgs/wiggle.png" alt="igv" height="400" width="400"> </div> ] .pull-right[ - Information line. - Chromosome. - Step size. - Step start position. - Score. ] --- ## Signal at genomic positions **bedGraph** .pull-left[ <div align="center"> <img src="imgs/bedgraph3.png" alt="igv" height="250" width="400"> </div> ] .pull-right[ - BED 3 format - Chromosome - Start - End 4th column - Score ] --- class: inverse, center, middle # Genomic Annotation. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Genomic Annotation **GFF: General Feature Format** <div align="center"> <img src="imgs/gff.png" alt="igv" height="150" width="700"> </div> - Used to genome annotation. - Stores position, feature (exon) and meta-feature (transcript/gene) information. --- ## Genomic Annotation <div align="center"> <img src="imgs/gff2.png" alt="igv" height="150" width="700"> </div> - Chromosome. - Start of feature. - End of Feature. - Strand. --- ## Genomic Annotation <div align="center"> <img src="imgs/gff3.png" alt="igv" height="150" width="700"> </div> - Source. - Feature type. - Score. --- ## Genomic Annotation <div align="center"> <img src="imgs/gff4.png" alt="igv" height="150" width="700"> </div> - Column 9 contains key pairs (ID=exon01), separated by semi-colons ";". - ID - Feature name. - PARENT- Meta-feature name. --- ## Genomic Variants - Variant Call Format (VCF). - Mutation Annotation Format (MAF). --- ## Variant Call Format Variant Call Format ([VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf)) is a text file format (most likely stored in a compressed manner). It contains - meta-information lines. - a header line. - data lines each containing information about a position in the genome . ***The format also has the ability to contain genotype information on samples for each position***. --- ## VCF Structure <div align="center"> <img src="imgs/Picture1.png" alt="datasetSource" height="460" width="900"> </div> --- ## Mutation Annotation Format (MAF) Mutation Annotation Format ([MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/)) is a tab-delimited text file with aggregated **mutation information** from VCF files. --- ## MAF Structure <div align="center"> <img src="imgs/Picture2.png" alt="datasetSource" height="300" width="800"> </div> --- class: inverse, center, middle # Genomic Files for computing <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## bigWig, bigBED and TABIX - Many programs and browsers deal better with compressed, indexed versions of genomic files - SAM -> BAM (.bam and index file of .bai) - Wiggle and bedGraph -> bigWig (.bw/.bigWig) - BED -> bigBed (.bb) - BED, VCF and GFF -> (.gz and index file of .tbi) --- class: inverse, center, middle # Experimental assays and metadata <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Assays and metadata The results of many experimental assays are summarised in combinations of tqbles. For example an experimental result could contain. - A table of measurements with columns as samples and rows as genes. - A sample to metadata (i.e. sample group) relationship table. - A gene to gene information table. - Tables containing transformations of the data. --- ## Formats for experimental results. Historically we may have stored these results in Spreadsheets. <div align="center"> <img src="imgs/ssComb.png" alt="igv" height="500" width="900"> </div> --- ## Efficient experiment results formats. With bigger datasets we want a format with. - Standards/cross-platform = Easier sharing of data. - Store and connect different data types = Maintain data and metadata information. - Fast and memory efficient retrieval of data = Handle big data queries. --- ## HDF5 (hierarchical Data Format) .pull-left[ - Developed by the HDF5 group. - Format and associated set of software libraries. - Cross-platform support. - Self-described format - Fast memory efficent I/O and data operations. - Suitable for very large datasets and associated metadata. ] .pull-right[ <div align="center"> <img src="imgs/HDF5.png" alt="igv" height="500" width="500"> </div> ] --- ## Single cell data and Loom format - scRNA/scATAC and other single cell datasets can contain large matrices of per cell measurements, associated per cell and per gene metadata as well data transformations. - The Loom file format makes use of and extends the HDF5 format for single cell data. <div align="center"> <img src="imgs/LOOM.png" alt="igv" height="150" width="600"> </div> Based on HDF5 so- - Implemented in multiple languages. - Fast and memory efficient access of scRNA data and metadata. --- ## Getting help and more information - UCSC file formats - https://genome.ucsc.edu/FAQ/FAQformat.html - IGV file formats - https://www.broadinstitute.org/igv/FileFormats - Sanger (GFF) - https://www.sanger.ac.uk/resources/software/gff/spec.html