Giulio Formenti, Ph.D.
Laboratory of
Neurogenetics of Language
Vertebrate Genome Laboratory
The Rockefeller University
gformenti@rockefeller.edu
Oliver Fedrigo, Ph.D.
Vertebrate Genome Laboratory
The Rockefeller University
ofedrigo@rockefeller.edu
In collaboration with the Vertebrate Genomes Laboratory
March 22nd & 25th - Monday & Thursday
«A knowledge of sequences could contribute much
to our understanding of living matter»
Frederick Sanger, 1980
To learn more: Giani et al., 2020
Refined partition chromatography:
The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.
To learn more: Giani et al., 2020
Refined partition chromatography:
The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.
To learn more: Giani et al., 2020
Refined partition chromatography:
The two chains are separated and fragmented, the fragments are individually read and sequences from each fragment overlapped to yield a complete sequence.
To learn more: Giani et al., 2020
Rodger Staden, invents of the first DNA sequencing ‘software’.
In 1982, Sanger uses it to assemble the entire 48,502 bp of bacteriophage Lambda genome.
To learn more: Giani et al., 2020
The process of determining the sequence of an organism without existing reference.
NGS
- Bridge amplification
- Short reads
- High throughput
- High quality
Pacbio (TGS)
- Single molecule
- Long reads
- Lower throughput
- Lower quality
High-quality error-free genome assemblies and annotations are necessary as current 1st and 2nd generation genome sequencing approaches generate numerous errors that cause a variety of problems in downstream analyses. Parts of genes are missing, and some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads. (Vertebrate Genomes Project)
An open, community-based effort to generate the first complete assembly of a human genome.
Hifi reads are nearly perfect in homopolymer-compressed space.
AATTCTACTCATAT__AAAAA__TCA__TTTTTT__CA → AATTCTACTCATAT__A__TCA__T__CA
Nurk et al., in preparation
Long-range interactions. Used also to reconstruct the 3D structure of DNA.
Differences between two humans:
Human-chimp differences (120 Mb overall):
3x10^7 substitutions → 30 Mb (25%)
5x10^6 indels (<80 bp) → 22 Mb (18%)
7x10^4 SVs (>80 bp) → 68 Mb (57%)
Unfortunately, there are: potential alignments between 2 sequences of length N
That is, with sequence length = 100:
2(2*200)/(3.14*100)(1/2) = 9.068476 × 10^58 alignments
1970: Needleman–Wunsch algorithm
Sequences
GCATGCU
GATTACA
Best alignments
GCATG-CU
G-ATTACA
GCA-TGCU
G-ATTACA
GCAT-GCU
G-ATTACA
banana → bamana (substitution of “n” for “m”)
bamana → bambna (substitution of “a” for “b”)
bambna → bambina (insertion of “i”).
EDIT DISTANCE = 3
You can calculate the edit distance for all possible alignments and choose the alignment that minimizes the edit distance (for longer sequences find an heuristic). It has been shown it is mathematically equivalent to optimal matching.
1981: Smith–Waterman algorithm
Compares segments of all possible lengths and optimizes the similarity measure
The main difference to the Needleman–Wunsch algorithm is that negative scoring matrix cells are set to zero, making positively scoring local alignments visible. Traceback procedure starts at the highest scoring matrix cell and proceeds until a cell with score zero is encountered, yielding the highest scoring local alignment
It finds the optimal “local” alignment (best local solution)
For slightly divergent sequences
Quadratic complexity in time and space, therefore it often cannot be practically applied to large-scale problems → you need linear solutions
Alignment-free approaches have been used for:
Short read assemblers:
- SGA String graph
- ValVel String graph
- DISCOVAR DBG
- SOAPdenovo DBG
- Euler DBG
- ABySS DBG
- Velvet DBG
- SPAdes DBG
- Edena OLC
- Ray Hybrid
- SSAKE Greedy
- Perga Greedy
- …
Long read assemblers:
- Hifiasm OLC
- Canu/HiCanu (ex Celera) OLC
- Peregrine HGAP/OLC
- Falcon-Unzip HGAP/OLC
- Flye Repeat graph
- …