Data Repositories

Getting hold of HTS data

  • From public repositories.
  • From collaborators.
  • By sequencing some of your own material!

class: inverse, center, middle

Repositories for HTS


Public Repositories for HTS

  • Several public sources of HTS data exist.
  • First concentrating on those acting as repositories.
    • GEO (Gene Expression Omnibus).
    • ENA (European Nucleotide Database).
    • SRA (Short Read Archive).

Gene Expression Omnibus

igv

  • GEO holds different types of biological datasets.
  • Very popular for submission of data accompanying publication.
  • Captures metadata, processed files and raw data.
  • GEO was not built for HTS data.

Gene Expression Omnibus

Short Read Archive

  • SRA (www.ncbi.nlm.nih.gov/sra)

igv

  • NCBI’s HTS specific repository.
  • Sequencing specific metadata.
  • Stores Raw data (in SRA format)
  • SRA format - requires SRA Toolkit

Short Read Archive

  • SRA (www.ncbi.nlm.nih.gov/sra)

European Nucleotide Archive

igv

  • ENA acts as a european HTS repository.
  • Mirrors much of SRA.
  • Stores Raw data
  • No SRA formats - fastq by default.

Other Repositories

igv

ENCODE Portal

ENCODE portal provides access to raw and processed/standardised results.

Repositories for processed data

igv igv

  • Other specialist repositories exist.
  • ReCount2 database provides standardised counts for user analysis.
  • Other databases like Immgen/Bodymap/expression atlas provide RNAseq for specific cells/tissues.

Reference data

  • Reference Genome available from many locations.
  • Different assemblies.
    • Major Revisisons - Change locations.
    • Minor Revisions - Update annotation.
  • Genome sequence stored as FASTA.
  • Gene build as GFF3 or GTF.
  • IGenomes contains full annotation files for many genomes.