class: center, middle, inverse, title-slide # RU Genome Assembly, Session 1
### Rockefeller University, The Vertebrate Genome Lab ###
http://rockefelleruniversity.github.io/GenomeAssembly/
--- # An introduction to genome assembly #### Principles and practicalities of assembling, evaluating and using a reference genome <div style="float:left; margin:10px" width="40%"> Giulio Formenti, Ph.D.<br/> Laboratory of <br/> Neurogenetics of Language<br/> Vertebrate Genome Laboratory<br/> The Rockefeller University<br/> <a href="mailto:gformenti@rockefeller.edu">gformenti@rockefeller.edu</a> </div> <div style="float:left; margin:10px" width="40%"> Oliver Fedrigo, Ph.D.<br/> Vertebrate Genome Laboratory<br/> The Rockefeller University<br/> <a href="mailto:ofedrigo@rockefeller.edu">ofedrigo@rockefeller.edu</a> </div> <br style="clear:left" /> __In collaboration with the Vertebrate Genomes Laboratory__ _March 22nd & 25th - Monday & Thursday_ <img src="imgs/VGL_logo.png" width=30%> <img src="imgs/RU_logo.png" width=15%> --- ## The Vertebrate Genomes Project <img src="imgs/VGP_phases.png" width=35% /> <img src="imgs/VGP_map.png" width=55% /> <center> <a href="https://vertebrategenomesproject.org/">VGP website</a><br/> <img src="imgs/VGP_logo.png" width=30% /> </center> --- ## Many international endeavours <img src="imgs/B10K_logo.png" width=30% /> <img src="imgs/dToL_logo.png" width=30% /> <img src="imgs/GIGA_logo.jpg" width=30% /> <img src="imgs/EBP_logo.jpg" width=30% /> <img src="imgs/ERGA_logo.jpg" width=30% /> <img src="imgs/BAT1K_logo.jpg" width=30% /> <p style="text-align:center"> <img src="imgs/BIOGENOMA_logo.png" width=30% /> </p> --- ## Why sequencing the DNA _«A knowledge of sequences could contribute much <br/> to our understanding of living matter»_ Frederick Sanger, 1980 <img src="imgs/science.png" width=70% /> --- ## <div style="font-size:30px;background-image: url('imgs/bg.png'); background-repeat:no-repeat; background-size: 796px; text-align:center; height=600px; padding:200px 0 400px 0" width=100%><b>Genome assembly:<br/>A little bit of history</b></div> --- ## <img src="imgs/gatc_history.png" /> To learn more: <a href="https://www.sciencedirect.com/science/article/pii/S2001037019303277">Giani et al., 2020</a> --- ## 1953: The sequence of insulin <img src="imgs/insulin_a.png" width=500px /> <img src="imgs/Sanger.png" width=197px /> _Refined partition chromatography:_ The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence. To learn more: <a href="https://www.sciencedirect.com/science/article/pii/S2001037019303277">Giani et al., 2020</a> --- ## 1953: The sequence of insulin <img src="imgs/insulin_b.png" width=500px /> <img src="imgs/Sanger.png" width=197px /> _Refined partition chromatography:_ The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence. To learn more: <a href="https://www.sciencedirect.com/science/article/pii/S2001037019303277">Giani et al., 2020</a> --- ## 1953: The sequence of insulin <img src="imgs/insulin_c.png" width=500px /> <img src="imgs/Sanger.png" width=197px /> _Refined partition chromatography:_ The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence. To learn more: <a href="https://www.sciencedirect.com/science/article/pii/S2001037019303277">Giani et al., 2020</a> --- ## 1979: Shotgun sequencing Rodger Staden, invents of the first DNA sequencing ‘software’. In 1982, Sanger uses it to assemble the entire 48,502 bp of bacteriophage Lambda genome. <img src="imgs/overlap.png" width=500px /> <img src="imgs/staden.png" width=197px /> To learn more: <a href="https://www.sciencedirect.com/science/article/pii/S2001037019303277">Giani et al., 2020</a> --- ## Whole Genome Sequencing <img src="imgs/timeline_half.png" width=400px /> --- ## De novo genome assembly The process of determining the sequence of an organism without existing reference. <img src="imgs/humanWGA.jpg" width=800px /> --- ## The HGP produced many tech advancements <img src="imgs/Moore_law.png" width=600px /> --- ## Whole Genome Sequencing <img src="imgs/timeline.png" width=800px /> --- ## Genbank <img src="imgs/Genbank1.png" width=800px /> --- ## Genbank <img src="imgs/Genbank2.png" width=800px /> --- ## NGS-based WGS <div style="text-align:center"> <div style="float:left"> <img src="imgs/science_birds1.png" height=250px /> <br/> <img src="imgs/science_birds2.png" height=250px /> </div> <img style="float:left" src="imgs/bird_tree.png" height=500px /> <br style="clear:left" /> </div> --- ## <div style="font-size:30px;background-image: url('imgs/bg.png'); background-repeat:no-repeat; background-size: 796px; text-align:center; height=600px; padding:200px 0 400px 0" width=100%><b>Genome assembly:<br/>Technologies</b></div> --- ## Repeats are a problem <img src="imgs/repeats.png" width=800px /> --- ## Third generation sequencing <div style="float:left; margin:10px"> <img src="imgs/NGS.png" width=300px /><br/> <b>NGS</b><br/> - Bridge amplification<br/> - Short reads<br/> - High throughput<br/> - High quality<br/> </div> <div style="float:left; margin:10px"> <img src="imgs/TGS.png" width=300px /><br/> <b>Pacbio (TGS)</b><br/> - Single molecule<br/> - Long reads<br/> - Lower throughput<br/> - Lower quality<br/> </div> <br style="clear:left" /> <a href="https://www.elsevier.es/es-revista-revista-mexicana-biodiversidad-91-articulo-the-study-biodiversity-in-era-S1870345314730076">Escalante et al., 2014</a> --- ## Third generation sequencing: Pacbio <img src="imgs/pacbio1.png" width=450px /> <img src="imgs/pacbio2.jpg" width=200px /> --- ## Third generation sequencing: Pacbio <img src="imgs/pacbio3.png" width=800px /> --- ## Third generation sequencing: Pacbio <img src="imgs/pacbio4.png" width=800px /> --- ## Long read matter High-quality error-free genome assemblies and annotations are necessary as current 1st and 2nd generation genome sequencing approaches generate numerous errors that cause a variety of problems in downstream analyses. <b>Parts of genes are missing, and some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads.</b> (Vertebrate Genomes Project) <img src="imgs/finch.png" width=800px /> <a href="https://academic.oup.com/gigascience/article/6/10/gix085/4096262">Korlach et al., 2017</a> --- ## Third generation sequencing: nanopore <img src="imgs/nanopore2.png" width=340px /> <img src="imgs/nanopore3.png" width=380px /> --- ## Linked reads <img src="imgs/10x.png" width=600px /> <img src="imgs/10x_logo.jpg" width=100px /> * DNA & RNA (single-cell) * Formerly 10x Genomics, now <a href="https://www.universalsequencing.com/technology">Universal Sequencing (TELL-Seq)</a> --- ## Pacbio Hifi <img src="imgs/pacbio_logo.png" width=200px /> <img src="imgs/ccs.png" width=100% /> --- ## The Telomere-to-Telomere Consortium (T2T) An open, community-based effort to generate the first complete assembly of a human genome. <img src="imgs/logsdon.png" width=100% /> <a href="https://www.biorxiv.org/content/10.1101/2020.09.08.285395v1">Logsdon et al., preprint</a> --- ## The Telomere-to-Telomere Consortium (T2T) <div style="text-align:center"> <img src="imgs/miga1.png" width=75% /> <img src="imgs/miga2.png" width=65% /> </div> <a href="https://www.nature.com/articles/s41586-020-2547-7">Miga et al., 2020</a> --- ## The Telomere-to-Telomere Consortium (T2T) Hifi reads are nearly perfect in homopolymer-compressed space. AATTCTACTCATAT__AAAAA__TCA__TTTTTT__CA → AATTCTACTCATAT__A__TCA__T__CA <img style="margin:10px" src="imgs/NGS.png" width=45% /> <img style="margin:10px" src="imgs/TGS.png" width=45% /> <a href="https://www.elsevier.es/es-revista-revista-mexicana-biodiversidad-91-articulo-the-study-biodiversity-in-era-S1870345314730076">Escalante et al., 2014</a> --- ## The Telomere-to-Telomere Consortium (T2T) <img src="imgs/nurk.png" width=100% /> Nurk et al., in preparation --- ## From reads to contigs <img src="imgs/reads2contigs.png" width=100% /> <div style="text-align:center"><img src="imgs/swallow.png" width=90% /></div> --- ## From tigs to scaffolds <img src="imgs/tigs2scaffolds.png" width=100% /> --- ## From scaffolds to chromosomes <img src="imgs/scaffolds2chr.png" width=100% /> --- ## Scaffolding: Optical maps <img style="position:absolute; position: 0 0" src="imgs/bionano1.png" width=20% /> <br/><br/><br/> <img src="imgs/bionano2.jpg" width=100% /> --- ## Scaffolding: Optical maps <img src="imgs/bionano3.jpg" width=100% /> <img src="imgs/bionano4.png" width=100% /> --- ## Scaffolding: Hic Long-range interactions. Used also to reconstruct the 3D structure of DNA. <img src="imgs/hic1.png" width=45% /> <img src="imgs/hic2.png" width=45% /> --- ## High molecular DNA required <img src="imgs/PFGE_FEMTO.png" width=100% /> <a href="https://academic.oup.com/gigascience/article/8/1/giy142/5202456">Formenti et al., 2019</a> --- ## VGP pipeline <img src="imgs/VGP_pipeline.png" width=100% /> <a href="https://www.biorxiv.org/content/10.1101/2020.05.22.110833v1">Rhie et al., in press</a> --- ## <div style="font-size:30px;background-image: url('imgs/bg.png'); background-repeat:no-repeat; background-size: 796px; text-align:center; height=600px; padding:200px 0 400px 0" width=100%><b>Genome assembly:<br/>Applications</b></div> --- ## Applications <div style="text-align:center"> <img src="imgs/applications_long_reads.png" width=100% /> </div> --- ## Contiguity matters <div style="text-align:center"> <img src="imgs/contiguity.png" width=80% /> </div> --- ## Contiguity matters Differences between two humans: <img src="imgs/human_SVs.png" width=100% /> Human-chimp differences (120 Mb overall): 3x10^7 substitutions → 30 Mb (25%) 5x10^6 indels (<80 bp) → 22 Mb (18%) 7x10^4 SVs (>80 bp) → 68 Mb (57%) --- ## Whole genome alignments <img src="imgs/chicken_barn_swallow.png" width=100% /> --- ## Whole genome alignments <div style="text-align:center"> <img src="imgs/chicken_zebra_finch.png" width=50% /> </div> --- ## Supergenes in the ruff <img src="imgs/ruff.png" width=100% /> <div style="text-align:center"> <a href="https://www.nature.com/articles/ng.3443">Küpper et al., 2015</a> <br /><br /><br /> <a href="https://www.scilifelab.se/news/a-supergene-underlies-genetic-differences-in-testosteron-levels-and-sexual-behaviour-in-male-ruff/">Watch this video!</a> </div> --- ## <div style="font-size:30px;background-image: url('imgs/bg.png'); background-repeat:no-repeat; background-size: 796px; text-align:center; height=600px; padding:200px 0 400px 0" width=100%><b>Genome assembly:<br/>Principles and practicalities</b></div> --- ## Everything starts from an alignment - Brute-force - Global alignments - Local alignments - Mixed global/local alignments - Alignment-free approaches --- ## Brute-force <ol> <li>Generate all possible alignments</li> <li>Calculate quality score for each alignment</li> <li>Pick the alignment with the highest quality score</li> <li>Done!</li> </ol> Unfortunately, there are: <img src="imgs/brute_force.png" width=10% /> potential alignments between 2 sequences of length N That is, with sequence length = 100: 2^(2\*200)/(3.14\*100)^(1/2) = 9.068476 × 10^58 alignments --- ## Global alignments <b>1970: Needleman–Wunsch algorithm</b> - Still guarantees that the best solution will be found - Also referred as optimal matching algorithm - Divides a large problem (that of aligning long sequences between each other) into smaller problems (a concept referred to as Dynamic programming) - Maximizes similarity - Goes from the smaller problems to the general problem solution - Used for ‘optimal’ global alignment - Generally for highly similar sequences (in sequence and length) - Very slow as the length of sequences grows, but of utmost precision --- ## <div style="float:left"> <br/><br/><br/><br/> Sequences<br/> ---------<br/> GCATGCU<br/> GATTACA<br/> <br/> Best alignments<br/> ---------------<br/> GCATG-CU<br/> G-ATTACA<br/> <br/> GCA-TGCU<br/> G-ATTACA<br/> <br/> GCAT-GCU<br/> G-ATTACA<br/> </div> <img style="float:left" src="imgs/NW_algorithm.png" width=70% /> <br style="clear:left" /> <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm">Wikipedia</a> --- ## The edit distance - Also referred as Levenshtein distance banana → bamana (substitution of "n" for "m") bamana → bambna (substitution of "a" for "b") bambna → bambina (insertion of "i"). EDIT DISTANCE = 3 You can calculate the edit distance for all possible alignments and choose the alignment that minimizes the edit distance (for longer sequences find an heuristic). It has been shown it is mathematically equivalent to optimal matching. --- ## Local alignments <b>1981: Smith–Waterman algorithm</b> - Compares segments of all possible lengths and optimizes the similarity measure - The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are set to zero, making positively scoring local alignments visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment - It finds the optimal “local” alignment (best local solution) - For slightly divergent sequences - Quadratic complexity in time and space, therefore it often cannot be practically applied to large-scale problems → you need linear solutions --- ## Alignment-free sequence comparisons Alignment-free approaches have been used for: - Transcript quantification → 10-100 times faster. - SNP calling for known variants (genotyping). - _De novo_ genome assembly. - Metagenomics. - Identification of highly divergent sequences. - Detection of functional similarities between regulatory elements (e.g. promoters, enhancers, and silencers). - Study of horizontal gene transfer. - DNA barcoding of species. --- ## Euclidean distance <div style="text-align:center"> <img src="imgs/Euclidean_distance.png" width=70% /> <br/> <a href="https://link.springer.com/article/10.1186/s13059-017-1319-7">Zielezinski et al. 2017</a> </div> --- ## Assembling a genome today <div style="float:left; margin:20px"> <b>Short read assemblers:</b><br/> <br/> - SGA <i>String graph</i><br/> - ValVel <i>String graph</i><br/> - DISCOVAR <i>DBG</i><br/> - SOAPdenovo <i>DBG</i><br/> - Euler <i>DBG</i><br/> - ABySS <i>DBG</i><br/> - Velvet <i>DBG</i><br/> - SPAdes <i>DBG</i><br/> - Edena <i>OLC</i><br/> - Ray <i>Hybrid</i><br/> - SSAKE <i>Greedy</i><br/> - Perga <i>Greedy</i><br/> - ...<br/> </div> <div style="float:left; margin:20px"> <b>Long read assemblers:</b><br/> <br/> - Hifiasm <i>OLC</i><br/> - Canu/HiCanu (ex Celera) <i>OLC</i><br/> - Peregrine <i>HGAP/OLC</i><br/> - Falcon-Unzip <i>HGAP/OLC</i><br/> - Flye <i>Repeat graph</i><br/> - ...<br/> </div> <br style="clear:left" /> <a href="https://en.wikipedia.org/wiki/De_novo_sequence_assemblers">Wikipedia</a> --- ## Assembling a genome today <img src="imgs/OLC_DBG1.png" width=100% /> <a href="https://academic.oup.com/bfg/article/11/1/25/191455">Li et al. 2012</a> --- ## Assembling a genome today <img src="imgs/OLC_DBG2.png" width=100% /> <a href="https://academic.oup.com/bfg/article/11/1/25/191455">Li et al. 2012</a> --- ## Canu <div style="text-align:center"> <img src="imgs/canu.png" width=50% /> </div> <a href="https://genome.cshlp.org/content/early/2017/03/15/gr.215087.116">Koren et al. 2017</a> --- ## Galaxy <img src="imgs/galaxy1.png" width=100% /> --- ## Genome assembly with Galaxy Unicycler <div style="text-align:center"> <img src="imgs/galaxy2.png" width=80% /> </div> --- ##Exercises Time for exercises! [Link here](../../exercises/exercises/Galaxy_ex_exercise.html)