An introduction to genome assembly

Principles and practicalities of assembling, evaluating and using a reference genome

Giulio Formenti, Ph.D.
Laboratory of
Neurogenetics of Language
Vertebrate Genome Laboratory
The Rockefeller University
gformenti@rockefeller.edu

Oliver Fedrigo, Ph.D.
Vertebrate Genome Laboratory
The Rockefeller University
ofedrigo@rockefeller.edu

In collaboration with the Vertebrate Genomes Laboratory

March 22nd & 25th - Monday & Thursday

The Vertebrate Genomes Project

VGP website

Many international endeavours

Why sequencing the DNA

«A knowledge of sequences could contribute much
to our understanding of living matter»

Frederick Sanger, 1980

Genome assembly:
A little bit of history

To learn more: Giani et al., 2020

1953: The sequence of insulin

Refined partition chromatography:

The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.

To learn more: Giani et al., 2020

1953: The sequence of insulin

Refined partition chromatography:

The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.

To learn more: Giani et al., 2020

1953: The sequence of insulin

Refined partition chromatography:

The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.

To learn more: Giani et al., 2020

1979: Shotgun sequencing

Rodger Staden, invents of the first DNA sequencing ‘software’.

In 1982, Sanger uses it to assemble the entire 48,502 bp of bacteriophage Lambda genome.

To learn more: Giani et al., 2020

Whole Genome Sequencing

De novo genome assembly

The process of determining the sequence of an organism without existing reference.

The HGP produced many tech advancements

Whole Genome Sequencing

Genbank

NGS-based WGS

Genome assembly:
Technologies

Repeats are a problem

Third generation sequencing

NGS
- Bridge amplification
- Short reads
- High throughput
- High quality

Pacbio (TGS)
- Single molecule
- Long reads
- Lower throughput
- Lower quality

Escalante et al., 2014

Third generation sequencing: Pacbio

Long read matter

High-quality error-free genome assemblies and annotations are necessary as current 1st and 2nd generation genome sequencing approaches generate numerous errors that cause a variety of problems in downstream analyses. Parts of genes are missing, and some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads. (Vertebrate Genomes Project)

Korlach et al., 2017

Third generation sequencing: nanopore

Linked reads

DNA & RNA (single-cell)
Formerly 10x Genomics, now Universal Sequencing (TELL-Seq)

Pacbio Hifi

The Telomere-to-Telomere Consortium (T2T)

An open, community-based effort to generate the first complete assembly of a human genome.

Logsdon et al., preprint

The Telomere-to-Telomere Consortium (T2T)

Miga et al., 2020

The Telomere-to-Telomere Consortium (T2T)

Hifi reads are nearly perfect in homopolymer-compressed space.

AATTCTACTCATAT__AAAAA__TCA__TTTTTT__CA → AATTCTACTCATAT__A__TCA__T__CA

Escalante et al., 2014

The Telomere-to-Telomere Consortium (T2T)

Nurk et al., in preparation

From reads to contigs

From tigs to scaffolds

From scaffolds to chromosomes

Scaffolding: Optical maps

Scaffolding: Hic

Long-range interactions. Used also to reconstruct the 3D structure of DNA.

High molecular DNA required

Formenti et al., 2019

VGP pipeline

Rhie et al., in press

Genome assembly:
Applications

Applications

Contiguity matters

Differences between two humans:

Human-chimp differences (120 Mb overall):

3x10^7 substitutions → 30 Mb (25%)

5x10^6 indels (<80 bp) → 22 Mb (18%)

7x10^4 SVs (>80 bp) → 68 Mb (57%)

Whole genome alignments

Supergenes in the ruff

Küpper et al., 2015

Watch this video!

Genome assembly:
Principles and practicalities

Everything starts from an alignment

Brute-force
Global alignments
Local alignments
Mixed global/local alignments
Alignment-free approaches

Brute-force

Generate all possible alignments
Calculate quality score for each alignment
Pick the alignment with the highest quality score
Done!

Unfortunately, there are: potential alignments between 2 sequences of length N

That is, with sequence length = 100:

2^{(2*200)/(3.14*100)}(1/2) = 9.068476 × 10^58 alignments

Global alignments

1970: Needleman–Wunsch algorithm

Still guarantees that the best solution will be found
Also referred as optimal matching algorithm
Divides a large problem (that of aligning long sequences between each other) into smaller problems (a concept referred to as Dynamic programming)
Maximizes similarity
Goes from the smaller problems to the general problem solution
Used for ‘optimal’ global alignment
Generally for highly similar sequences (in sequence and length)
Very slow as the length of sequences grows, but of utmost precision

Sequences

GCATGCU
GATTACA

Best alignments

GCATG-CU
G-ATTACA

GCA-TGCU
G-ATTACA

GCAT-GCU
G-ATTACA

Wikipedia

The edit distance

Also referred as Levenshtein distance

banana → bamana (substitution of “n” for “m”)

bamana → bambna (substitution of “a” for “b”)

bambna → bambina (insertion of “i”).

EDIT DISTANCE = 3

You can calculate the edit distance for all possible alignments and choose the alignment that minimizes the edit distance (for longer sequences find an heuristic). It has been shown it is mathematically equivalent to optimal matching.

Local alignments

1981: Smith–Waterman algorithm

Compares segments of all possible lengths and optimizes the similarity measure
The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are set to zero, making positively scoring local alignments visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment
It finds the optimal “local” alignment (best local solution)
For slightly divergent sequences
Quadratic complexity in time and space, therefore it often cannot be practically applied to large-scale problems → you need linear solutions

Alignment-free sequence comparisons

Alignment-free approaches have been used for:

Transcript quantification → 10-100 times faster.
SNP calling for known variants (genotyping).
De novo genome assembly.
Metagenomics.
Identification of highly divergent sequences.
Detection of functional similarities between regulatory elements (e.g. promoters, enhancers, and silencers).
Study of horizontal gene transfer.
DNA barcoding of species.

Euclidean distance

Zielezinski et al. 2017

Assembling a genome today

Short read assemblers:

- SGA   String graph
- ValVel   String graph
- DISCOVAR   DBG
- SOAPdenovo   DBG
- Euler   DBG
- ABySS   DBG
- Velvet   DBG
- SPAdes   DBG
- Edena   OLC
- Ray   Hybrid
- SSAKE   Greedy
- Perga   Greedy
- …

Long read assemblers:

- Hifiasm   OLC
- Canu/HiCanu (ex Celera)   OLC
- Peregrine   HGAP/OLC
- Falcon-Unzip   HGAP/OLC
- Flye   Repeat graph
- …

Wikipedia

Assembling a genome today

Li et al. 2012

Assembling a genome today

Li et al. 2012

Canu

Koren et al. 2017

Galaxy

Genome assembly with Galaxy Unicycler

##Exercises

Time for exercises! Link here

RU Genome Assembly, Session 1

Rockefeller University, The Vertebrate Genome Lab

http://rockefelleruniversity.github.io/GenomeAssembly/

An introduction to genome assembly

Principles and practicalities of assembling, evaluating and using a reference genome

The Vertebrate Genomes Project

Many international endeavours

Why sequencing the DNA

1953: The sequence of insulin

1953: The sequence of insulin

1953: The sequence of insulin

1979: Shotgun sequencing

Whole Genome Sequencing

De novo genome assembly

The HGP produced many tech advancements

Whole Genome Sequencing

Genbank

Genbank

NGS-based WGS

Repeats are a problem

Third generation sequencing

Third generation sequencing: Pacbio

Third generation sequencing: Pacbio

Third generation sequencing: Pacbio

Long read matter

Third generation sequencing: nanopore

Linked reads

Pacbio Hifi

The Telomere-to-Telomere Consortium (T2T)

The Telomere-to-Telomere Consortium (T2T)

The Telomere-to-Telomere Consortium (T2T)

The Telomere-to-Telomere Consortium (T2T)

From reads to contigs

From tigs to scaffolds

From scaffolds to chromosomes

Scaffolding: Optical maps

Scaffolding: Optical maps

Scaffolding: Hic

High molecular DNA required

VGP pipeline

Applications

Contiguity matters

Contiguity matters

Whole genome alignments

Whole genome alignments

Supergenes in the ruff

Everything starts from an alignment

Brute-force

Global alignments

The edit distance

Local alignments

Alignment-free sequence comparisons

Euclidean distance

Assembling a genome today

Assembling a genome today

Assembling a genome today

Canu

Galaxy

Genome assembly with Galaxy Unicycler