Genome assembly:
Evaluation

Synopsis

  1. Estimation of genome parameters (Genomescope)
  2. General assembly metrics (contiguity, quality)
  3. Assembly completeness (BUSCO)
  4. Structural accuracy
  5. Kmer multiplicity (false duplications, missing kmers)
  6. Switch rates
  7. WGA (D-Genies, Mummer, Cactus)

1. Genome parameters

Unrivalled software for assembly-free estimates is Genomescope (v2.0, 2020)

What is a genome assembly?

A genome assembly is the entire genomic sequence derived through a de novo (i.e. reference-free) assembly process of the raw sequencing reads and released by the curators of the genome in the database upon publication.

  • Until now a single ‘reference’ individual or pooled individuals in a single reference.
  • Haploid.
  • Sometimes alternative alleles reported as variants.
  • Haplotigs were often completely shuffled as it is unfeasible to determine long blocks of haplotype with NGS.
  • Sometimes more than one assembly is present.

The primary assembly

The Primary Assembly constitutes what Genbank curators consider the most up-to-date source for the genomic sequence for this reference.

When available, this is the first result being shown.

Picking a random human gene, the Primary Assembly normally refers to the human genome assembly ‘GRCh38’ (Genome Reference Consortium human build n. 38), the latest release of the long list of high-quality assemblies for the human genome generated since 2001.

The primary assembly

2. Contiguity

2. Contiguity

2. Quality (QV)

3. BUSCO scores

4. Structural accuracy

5. Kmer multiplicity

5. Kmer multiplicity

6. Switch error rate

Many tools, one pipeline