An introduction to genome assembly

Principles and practicalities of assembling, evaluating and using a reference genome

Giulio Formenti, Ph.D.
Laboratory of
Neurogenetics of Language
Vertebrate Genome Laboratory
The Rockefeller University
gformenti@rockefeller.edu

Oliver Fedrigo, Ph.D.
Vertebrate Genome Laboratory
The Rockefeller University
ofedrigo@rockefeller.edu


In collaboration with the Vertebrate Genomes Laboratory

March 22nd & 25th - Monday & Thursday

The Vertebrate Genomes Project

VGP website

Many international endeavours

Why sequencing the DNA

«A knowledge of sequences could contribute much
to our understanding of living matter»

Frederick Sanger, 1980

Genome assembly:
A little bit of history

To learn more: Giani et al., 2020

1953: The sequence of insulin

Refined partition chromatography:

The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.

To learn more: Giani et al., 2020

1953: The sequence of insulin

Refined partition chromatography:

The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.

To learn more: Giani et al., 2020

1953: The sequence of insulin

Refined partition chromatography:

The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.

To learn more: Giani et al., 2020

1979: Shotgun sequencing

Rodger Staden, invents of the first DNA sequencing ‘software’.

In 1982, Sanger uses it to assemble the entire 48,502 bp of bacteriophage Lambda genome.

To learn more: Giani et al., 2020

Whole Genome Sequencing

De novo genome assembly

The process of determining the sequence of an organism without existing reference.

The HGP produced many tech advancements

Whole Genome Sequencing

Genbank

Genbank

NGS-based WGS



Genome assembly:
Technologies

Repeats are a problem

Third generation sequencing


NGS
- Bridge amplification
- Short reads
- High throughput
- High quality


Pacbio (TGS)
- Single molecule
- Long reads
- Lower throughput
- Lower quality


Escalante et al., 2014

Third generation sequencing: Pacbio

Third generation sequencing: Pacbio

Third generation sequencing: Pacbio

Long read matter

High-quality error-free genome assemblies and annotations are necessary as current 1st and 2nd generation genome sequencing approaches generate numerous errors that cause a variety of problems in downstream analyses. Parts of genes are missing, and some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads. (Vertebrate Genomes Project)

Korlach et al., 2017

Third generation sequencing: nanopore

Linked reads

Pacbio Hifi

The Telomere-to-Telomere Consortium (T2T)

An open, community-based effort to generate the first complete assembly of a human genome.

Logsdon et al., preprint

The Telomere-to-Telomere Consortium (T2T)

Miga et al., 2020

The Telomere-to-Telomere Consortium (T2T)

Hifi reads are nearly perfect in homopolymer-compressed space.

AATTCTACTCATAT__AAAAA__TCA__TTTTTT__CA → AATTCTACTCATAT__A__TCA__T__CA

Escalante et al., 2014

The Telomere-to-Telomere Consortium (T2T)

Nurk et al., in preparation

From reads to contigs

From tigs to scaffolds

From scaffolds to chromosomes

Scaffolding: Optical maps




Scaffolding: Optical maps

Scaffolding: Hic

Long-range interactions. Used also to reconstruct the 3D structure of DNA.

High molecular DNA required

Formenti et al., 2019

VGP pipeline

Rhie et al., in press

Genome assembly:
Applications

Applications

Contiguity matters

Contiguity matters

Differences between two humans:

Human-chimp differences (120 Mb overall):

3x10^7 substitutions → 30 Mb (25%)

5x10^6 indels (<80 bp) → 22 Mb (18%)

7x10^4 SVs (>80 bp) → 68 Mb (57%)

Whole genome alignments

Whole genome alignments

Supergenes in the ruff

Genome assembly:
Principles and practicalities

Everything starts from an alignment

  • Brute-force
  • Global alignments
  • Local alignments
  • Mixed global/local alignments
  • Alignment-free approaches

Brute-force

  1. Generate all possible alignments
  2. Calculate quality score for each alignment
  3. Pick the alignment with the highest quality score
  4. Done!

Unfortunately, there are: potential alignments between 2 sequences of length N

That is, with sequence length = 100:

2(2*200)/(3.14*100)(1/2) = 9.068476 × 10^58 alignments

Global alignments

1970: Needleman–Wunsch algorithm

  • Still guarantees that the best solution will be found
  • Also referred as optimal matching algorithm
  • Divides a large problem (that of aligning long sequences between each other) into smaller problems (a concept referred to as Dynamic programming)
  • Maximizes similarity
  • Goes from the smaller problems to the general problem solution
  • Used for ‘optimal’ global alignment
  • Generally for highly similar sequences (in sequence and length)
  • Very slow as the length of sequences grows, but of utmost precision





Sequences

GCATGCU
GATTACA

Best alignments

GCATG-CU
G-ATTACA

GCA-TGCU
G-ATTACA

GCAT-GCU
G-ATTACA


Wikipedia

The edit distance

  • Also referred as Levenshtein distance

banana → bamana (substitution of “n” for “m”)

bamana → bambna (substitution of “a” for “b”)

bambna → bambina (insertion of “i”).

EDIT DISTANCE = 3

You can calculate the edit distance for all possible alignments and choose the alignment that minimizes the edit distance (for longer sequences find an heuristic). It has been shown it is mathematically equivalent to optimal matching.

Local alignments

1981: Smith–Waterman algorithm

  • Compares segments of all possible lengths and optimizes the similarity measure

  • The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are set to zero, making positively scoring local alignments visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment

  • It finds the optimal “local” alignment (best local solution)

  • For slightly divergent sequences

  • Quadratic complexity in time and space, therefore it often cannot be practically applied to large-scale problems → you need linear solutions

Alignment-free sequence comparisons

Alignment-free approaches have been used for:

  • Transcript quantification → 10-100 times faster.
  • SNP calling for known variants (genotyping).
  • De novo genome assembly.
  • Metagenomics.
  • Identification of highly divergent sequences.
  • Detection of functional similarities between regulatory elements (e.g. promoters, enhancers, and silencers).
  • Study of horizontal gene transfer.
  • DNA barcoding of species.

Euclidean distance

Assembling a genome today

Short read assemblers:

- SGA   String graph
- ValVel   String graph
- DISCOVAR   DBG
- SOAPdenovo   DBG
- Euler   DBG
- ABySS   DBG
- Velvet   DBG
- SPAdes   DBG
- Edena   OLC
- Ray   Hybrid
- SSAKE   Greedy
- Perga   Greedy
- …

Long read assemblers:

- Hifiasm   OLC
- Canu/HiCanu (ex Celera)   OLC
- Peregrine   HGAP/OLC
- Falcon-Unzip   HGAP/OLC
- Flye   Repeat graph
- …


Wikipedia

Assembling a genome today

Li et al. 2012

Assembling a genome today

Li et al. 2012

Galaxy

Genome assembly with Galaxy Unicycler

##Exercises

Time for exercises! Link here