Genomic Features are often stored as GTF or GFF files.
In particular, we often find gene models as GFF or GTF files.
This importantly allows us to associate:
With this mapping we can perform complex operations such as summarising genome wide RNAseq signal to genes/transcripts expressions.
We use GFF and GTF in IGV to display gene models.
GTF is perhaps the most common format for gene models.
In this example we have a GTF file containing a single gene model for the Xkr4 gene.
This can be found in data/Xkr4.gtf
IGV parses information from the GTF file to construct the gene model we see containing untranslated regions and directionality of expression.
In Bioconductor, gene models are prepackaged for us in the TxDb packages.
Format is TxDb. species . source . major version . table
Homo Sapiens gene build from UCSC’s version hg19 known gene table - TxDb.Hsapiens.UCSC.hg19.knownGene
We can however make use of GTF/GFF files of our own gene models outside of Bioconductor pre-built TxDB packages.
The GenomicFeatures package has functions to import a GTF/GFF into a Bioconductor TxDb object as well as to interact with TxDb objects.
The first package we will look at is the TxDB package - TxDb.Hsapiens.UCSC.hg19.knownGene.
Remember we can install Bioconductor packages (fairly) easily using Bioconductor package pages’ provided commands.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("TxDb.Hsapiens.UCSC.hg19.knownGene") BiocManager
Now we have the package installed, we can load the library.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
Once we have loaded the TxDb.Hsapiens.UCSC.hg19.knownGene library, we will have a new TxDb object available to us called TxDb.Hsapiens.UCSC.hg19.knownGene.
class(TxDb.Hsapiens.UCSC.hg19.knownGene)
## [1] "TxDb"
## attr(,"package")
## [1] "GenomicFeatures"
When this variable name is typed in into console we can retrieve some key information on the Gene Model build as well as summary metrics.
TxDb.Hsapiens.UCSC.hg19.knownGene
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
As suggested in the name, TxDb objects are a database containing information we would store in GTF/GFF files.
To access information in our TxDb object we will need to use special accessor functions.
These accessors return information in GRanges’ formats allowing us to work with methods from the GenomicRanges packages.
We have a number of functions which allow us to extract GRanges for our features of interest including:
We can extract genes with the genes() function and the TxDb object as a parameter.
<- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) myGenes
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
myGenes
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 58858172-58874214 - | 1
## 10 chr8 18248755-18258723 + | 10
## 100 chr20 43248163-43280376 - | 100
## 1000 chr18 25530930-25757445 - | 1000
## 10000 chr1 243651535-244006886 - | 10000
## ... ... ... ... . ...
## 9991 chr9 114979995-115095944 - | 9991
## 9992 chr21 35736323-35743440 + | 9992
## 9993 chr22 19023795-19109967 - | 9993
## 9994 chr6 90539619-90584155 + | 9994
## 9997 chr22 50961997-50964905 - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
We can extract exons with the exons() function and the TxDb object as a parameter in the same way.
<- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
myExons myExons
## GRanges object with 289969 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 11874-12227 + | 1
## [2] chr1 12595-12721 + | 2
## [3] chr1 12613-12721 + | 3
## [4] chr1 12646-12697 + | 4
## [5] chr1 13221-14409 + | 5
## ... ... ... ... . ...
## [289965] chrUn_gl000241 35706-35859 - | 289965
## [289966] chrUn_gl000241 36711-36875 - | 289966
## [289967] chrUn_gl000243 11501-11530 + | 289967
## [289968] chrUn_gl000243 13608-13637 + | 289968
## [289969] chrUn_gl000247 5787-5816 - | 289969
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
We can extract transcripts with the transcripts() functions and the TxDb object as a parameter.
<- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
myTranscripts myTranscripts
## GRanges object with 82960 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 11874-14409 + | 1 uc001aaa.3
## [2] chr1 11874-14409 + | 2 uc010nxq.1
## [3] chr1 11874-14409 + | 3 uc010nxr.1
## [4] chr1 69091-70008 + | 4 uc001aal.1
## [5] chr1 321084-321115 + | 5 uc001aaq.2
## ... ... ... ... . ... ...
## [82956] chrUn_gl000237 1-2686 - | 82956 uc011mgu.1
## [82957] chrUn_gl000241 20433-36875 - | 82957 uc011mgv.2
## [82958] chrUn_gl000243 11501-11530 + | 82958 uc011mgw.1
## [82959] chrUn_gl000243 13608-13637 + | 82959 uc022brq.1
## [82960] chrUn_gl000247 5787-5816 - | 82960 uc022brr.1
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
We can extract coding regions with the cds() functions and the TxDb object as a parameter.
<- cds(TxDb.Hsapiens.UCSC.hg19.knownGene)
myCDS myCDS
## GRanges object with 237533 ranges and 1 metadata column:
## seqnames ranges strand | cds_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 12190-12227 + | 1
## [2] chr1 12595-12721 + | 2
## [3] chr1 13403-13639 + | 3
## [4] chr1 69091-70008 + | 4
## [5] chr1 324343-324345 + | 5
## ... ... ... ... . ...
## [237529] chrUn_gl000228 30530-31035 - | 237529
## [237530] chrUn_gl000228 31353-31430 - | 237530
## [237531] chrUn_gl000228 31660-31734 - | 237531
## [237532] chrUn_gl000228 31660-31737 - | 237532
## [237533] chrUn_gl000228 31996-32173 - | 237533
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
We can control which additional information such as transcript ID and gene ID we retrieve in metadata columns using the columns parameter in the exons()/transcripts()/cds() functions.
For a full list of available columns see ?transcripts() function help.
<- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene,
myTranscripts columns=c("gene_id","tx_id"))
1:2] myTranscripts[
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | gene_id tx_id
## <Rle> <IRanges> <Rle> | <CharacterList> <integer>
## [1] chr1 11874-14409 + | 100287102 1
## [2] chr1 11874-14409 + | 100287102 2
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Another useful function is the promoters() function.
This accepts additional parameters of upstream and downstream to specify regions around TSS to be used as a promoter while accounting for a transcripts’s strand.
We have been using the resize() function with GRanges objects for the same purpose.
<- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene,
myPromoters upstream=2000,downstream=50)
1:2] myPromoters[
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## uc001aaa.3 chr1 9874-11923 + | 1 uc001aaa.3
## uc010nxq.1 chr1 9874-11923 + | 2 uc010nxq.1
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
A major component of gene models in GFF and GTF files is the grouping of exons to transcripts/genes and transcripts to genes.
We can extract lists of GRanges (a GRangesList) containing the grouping of features to metafeatures using additional accessor functions.
These include the transcriptsBy(), exonsBy() and cdsBy().
The transcriptsBy() function takes as argument a TxDb object and a by argument specifying the metafeature/feature to group by.
For the transcriptsBy(), the by argument may be gene, exon or cds.
<- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,
transcriptByGenes by="gene")
1:2] transcriptByGenes[
## GRangesList object of length 2:
## $`1`
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr19 58858172-58864865 - | 70455 uc002qsd.4
## [2] chr19 58859832-58874214 - | 70456 uc002qsf.2
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
##
## $`10`
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr8 18248755-18258723 + | 31944 uc003wyw.1
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
As with all lists we can extract information from elements in the GRangesList using the $ and [[]] accessors.
This will extract the grouped intervals as a GRanges.
# transcriptByGenes$1 or
1]] transcriptByGenes[[
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr19 58858172-58864865 - | 70455 uc002qsd.4
## [2] chr19 58859832-58874214 - | 70456 uc002qsf.2
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Similarly the very useful exonsBy() function takes again an argument for a TxDb object and a by argument.
For the exonsBy() and cdsBy() functions, the by argument may be gene, or transcript.
<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,
exonsByTranscript by="tx")
1:2] exonsByTranscript[
## GRangesList object of length 2:
## $`1`
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 11874-12227 + | 1 <NA> 1
## [2] chr1 12613-12721 + | 3 <NA> 2
## [3] chr1 13221-14409 + | 5 <NA> 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
##
## $`2`
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 11874-12227 + | 1 <NA> 1
## [2] chr1 12595-12721 + | 2 <NA> 2
## [3] chr1 13403-14409 + | 6 <NA> 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Often we will have our favourite gene models stored in a GTF outside of Bioconductor objects. If we wish to take advantage of all the features in Bioconductor we will need to import it into a TxDb object.
We can do this using the makeTxDbFromGFF() function which will parse GTF or GFF into a TxDb object
<- makeTxDbFromGFF("data/Xkr4.gtf") myCustomTxDb
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
class(myCustomTxDb)
## [1] "TxDb"
## attr(,"package")
## [1] "GenomicFeatures"
myCustomTxDb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: data/Xkr4.gtf
## # Organism: NA
## # Taxonomy ID: NA
## # miRBase build ID: NA
## # Genome: NA
## # Nb of transcripts: 1
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-07-16 00:26:29 +0000 (Fri, 16 Jul 2021)
## # GenomicFeatures version at creation time: 1.44.0
## # RSQLite version at creation time: 2.2.7
## # DBSCHEMAVERSION: 1.2
Now we have our own custom GTF object we can make use of the functions from GenomicFeatures package such as retrieving gene coordinates as a GRanges.
genes(myCustomTxDb)
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## Xkr4 chr1 3214482-3671498 - | Xkr4
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Or exons grouped by genes as a GRangesList.
exonsBy(myCustomTxDb,by="gene")
## GRangesList object of length 1:
## $Xkr4
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 3214482-3216968 - | 1 <NA>
## [2] chr1 3421702-3421901 - | 2 <NA>
## [3] chr1 3670552-3671498 - | 3 <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We can create a TxDb directly from UCSC’s database using the makeTxDbFromUCSC() function.
The makeTxDbFromUCSC function simply requires a genome name to build a TxDb object.
We can review the genomes available to us using the ucscGenomes() function from the rtracklayer package. This returns a data frame of genomes from which we can build a TxDB object.
library(rtracklayer)
<- ucscGenomes()
availableGenomes 1:4,] availableGenomes[
## db species date
## 1 ailMel1 Panda Dec. 2009
## 2 allMis1 American alligator Aug. 2012
## 3 anoCar1 Lizard Feb. 2007
## 4 anoCar2 Lizard May 2010
## name
## 1 BGI-Shenzhen AilMel 1.0 Dec. 2009
## 2 International Crocodilian Genomes Working Group
## 3 Broad Institute AnoCar (1.0)
## 4 Broad Institute of MIT and Harvard AnoCar 2.0 (GCA_000090745.1)
Once we have identified the genome build we wish to use we can specify the genome parameter in the makeTxDbFromUCSC() function.
<- makeTxDbFromUCSC(genome="hg18")
hg18TxDb hg18TxDb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg18
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # UCSC Track: UCSC Genes
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: NA
## # Nb of transcripts: 66803
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-03-04 18:35:05 -0500 (Thu, 04 Mar 2021)
## # GenomicFeatures version at creation time: 1.40.1
## # RSQLite version at creation time: 2.2.3
## # DBSCHEMAVERSION: 1.2
And again we can now make use of GenomicFeatures package’s functions with our new TxDb objects
<- promoters(hg18TxDb,2000,50) hg18Promoters
We can make use of the export.gff() in the rtracklayer package to export our TxDb object to a gtf or gff file. We can include an additional argument of format to the export.gff() function.
We can specify the format to be gff or gtf as shown below.
export.gff(myCustomTxDb,con="customTxDbb.gff",format="gff")
export.gff(myCustomTxDb,con="customTxDbb.gtf",format="gtf")
We have seen that we can extract transcripts and exons grouped by their genes/transcripts using the transcriptsBy() and exonsBy() functions.
To identify the total number of groups we can simply use the length() function. Here we have 23459 transcipts or groups of exons.
<- exonsBy(hg18TxDb,by="gene")
transcriptByGenes length(transcriptByGenes)
## [1] 20121
To find the number of exons in every group/transcript we can use the function lengths(). The lengths() function will tell us the length of every group in our GRangesList
<- lengths(transcriptByGenes)
transcriptNumberPerGene 1:5] transcriptNumberPerGene[
## 1 10 100 1000 10000
## 15 3 12 16 16
We may wish to know the total sum length of all exons in every transcript. The transcriptLengths() function allows us to find all transcript lengths from a TxDb object as well as additional information.
<- transcriptLengths(hg18TxDb)
transcript_Lens 1:5,] transcript_Lens[
## tx_id tx_name gene_id nexon tx_len
## 1 1 uc001aaa.2 <NA> 3 2122
## 2 2 uc009vip.1 <NA> 2 2772
## 3 3 uc009vjg.1 <NA> 3 708
## 4 4 uc009vjh.1 79501 3 888
## 5 5 uc001aal.1 79501 1 918
As with many Bioconductor packages, there is interoperability between BSgenome packages holding genome sequences and TxDb objects with gene models. One very useful function is the extractTranscriptSeqs() function.
The extractTranscriptSeqs() accepts a BSgenome object of genomic sequence of interest and a TxDb object.
library(BSgenome.Hsapiens.UCSC.hg19)
<- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19,
hg19TransSeq transcripts=TxDb.Hsapiens.UCSC.hg19.knownGene)
hg19TransSeq
## DNAStringSet object of length 82960:
## width seq names
## [1] 1652 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 1
## [2] 1488 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 2
## [3] 1595 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 3
## [4] 918 ATGGTGACTGAATTCATTTTT...ATTCTAGTGTAAAGTTTTAG 4
## [5] 32 TACAGACCAAGCTCATGACTCACAATGGCCTA 5
## ... ... ...
## [82956] 1217 GCCAGTTTAGGGTCTCTGGTA...AACCTCCGCCTCCTGAGATC 82956
## [82957] 737 CTCCACTTCTGATCCTCCCCG...CCGCTTTATTAGATGCAGTG 82957
## [82958] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA 82958
## [82959] 30 TTGCAGAGGTGGCTGGTTGCTCTTTGAGCC 82959
## [82960] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA 82960
We can then either work with the resulting DNAStringSet then in R or write to Fasta file for analysis in external programs.
writeXStringSet(hg19TransSeq,"myTranscriptSequences.fa")
So far we have used the standard UCSC style of chromosome names
Sadly, conventions for chromosome and contig names can vary across different annotation sources.
Ensembl for instance uses.
The GenomeInfoDb package and associated GenomeInfoDbData package contain mappings between differing naming conventions and functions to convert between them.
First we need to load the GenomeInfoDb library.
library(GenomeInfoDb)
We can review all the mappings for supported organisms’ genomes using the genomeStyles() function with no arguments.
The genomeStyles() function returns a list of mapping data.frames of each organism.
<- genomeStyles()
allMappings names(allMappings)
## [1] "Arabidopsis_thaliana" "Caenorhabditis_elegans"
## [3] "Canis_familiaris" "Cyanidioschyzon_merolae"
## [5] "Drosophila_melanogaster" "Homo_sapiens"
## [7] "Mus_musculus" "Oryza_sativa"
## [9] "Populus_trichocarpa" "Rattus_norvegicus"
## [11] "Saccharomyces_cerevisiae" "Zea_mays"
We can then review mappings for our organism of interest’s genomes by retrieving the relevant entry from our list returned by the genomeStyles() function.
The resulting data.frame contains information on contig/chromosome mappings as well as additional information on chromosomes, i.e. Circularity.
#allMappings$Homo_sapiens or
"Homo_sapiens"]] allMappings[[
## circular auto sex NCBI UCSC dbSNP Ensembl
## 1 FALSE TRUE FALSE 1 chr1 ch1 1
## 2 FALSE TRUE FALSE 2 chr2 ch2 2
## 3 FALSE TRUE FALSE 3 chr3 ch3 3
## 4 FALSE TRUE FALSE 4 chr4 ch4 4
## 5 FALSE TRUE FALSE 5 chr5 ch5 5
## 6 FALSE TRUE FALSE 6 chr6 ch6 6
## 7 FALSE TRUE FALSE 7 chr7 ch7 7
## 8 FALSE TRUE FALSE 8 chr8 ch8 8
## 9 FALSE TRUE FALSE 9 chr9 ch9 9
## 10 FALSE TRUE FALSE 10 chr10 ch10 10
## 11 FALSE TRUE FALSE 11 chr11 ch11 11
## 12 FALSE TRUE FALSE 12 chr12 ch12 12
## 13 FALSE TRUE FALSE 13 chr13 ch13 13
## 14 FALSE TRUE FALSE 14 chr14 ch14 14
## 15 FALSE TRUE FALSE 15 chr15 ch15 15
## 16 FALSE TRUE FALSE 16 chr16 ch16 16
## 17 FALSE TRUE FALSE 17 chr17 ch17 17
## 18 FALSE TRUE FALSE 18 chr18 ch18 18
## 19 FALSE TRUE FALSE 19 chr19 ch19 19
## 20 FALSE TRUE FALSE 20 chr20 ch20 20
## 21 FALSE TRUE FALSE 21 chr21 ch21 21
## 22 FALSE TRUE FALSE 22 chr22 ch22 22
## 23 FALSE FALSE TRUE X chrX chX X
## 24 FALSE FALSE TRUE Y chrY chY Y
## 25 TRUE FALSE FALSE MT chrM chMT MT
The seqlevelsStyle() function allows us to review as well as set the naming convention for our GenomicRanges or TxDb objects.
First lets review the genes from the TxDb.Hsapiens.UCSC.hg19.knownGene package. Here we see they are UCSC style.
<- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) myGenes
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
seqlevelsStyle(myGenes)
## [1] "UCSC"
1:2,] myGenes[
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 58858172-58874214 - | 1
## 10 chr8 18248755-18258723 + | 10
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Now we can update them to Ensembl, NCBI or dbSNP styles. Here we update to Ensembl chromosome naming conventions.
This conversion is essential if we want to use our objects with data generated following Ensembl styles.
seqlevelsStyle(myGenes) <- "Ensembl"
1:2,] myGenes[
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 19 58858172-58874214 - | 1
## 10 8 18248755-18258723 + | 10
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Where as the TxDb packages hold information on gene models such as exons positions and exon to gene mapping, the OrgDb packages contain information on gene’s mapping to alternative IDs and any functional annotation.
Information on model organism’s gene annotation is contained with the org.db packages.
Format is org. species . ID type .db
Homo Sapiens annotation with Entrez Gene IDs – org.Hs.eg.db
To gain access to gene annotation we must first load the library for the relevant species.
Here we install and load the library for Human, Org.Hs.eg.db.
::install("org.Hs.eg.db")
BiocManagerlibrary(org.Hs.eg.db)
class(org.Hs.eg.db)
## [1] "OrgDb"
## attr(,"package")
## [1] "AnnotationDbi"
As with the TxDb packages, we will need to use special accessors to retrieve information from Org.db.
We can use the functions:
We will explore how to retrieve annotation from gene-centric organism level annotation package (org.Hs.eg.db) using these functions.
We can use the columns() function first to list available annotation.
Here we see our annotation for alternative IDs (ENSEMBL, REFSEQ, GENENAME) as well as functional annotation (GO, PATH, OMIM)
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
– To know more about the above identifier types
help(GENENAME)
Which keytypes can be used to query this database can be retrieved using keytypes.
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
If we want to extract few identifiers of a particular keytype, we can use keys() function
keys(org.Hs.eg.db, keytype="SYMBOL")[1:10]
## [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
## [7] "SERPINA3" "AADAC" "AAMP" "AANAT"
We can extract other annotations for a particular identifier using select function
select(org.Hs.eg.db, keys = "A1BG", keytype = "SYMBOL",
columns = c("SYMBOL", "GENENAME", "CHR") )
## SYMBOL GENENAME CHR
## 1 A1BG alpha-1-B glycoprotein 19
We can use this now to translate our Entrez IDs from TxDb object to a gene name
<- genes(hg18TxDb) geneLocations
## 379 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
geneLocations
## GRanges object with 19742 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 63549984-63565932 - | 1
## 10 chr8 18293035-18303003 + | 10
## 100 chr20 42681577-42713790 - | 100
## 1000 chr18 23784933-24011189 - | 1000
## 10000 chr1 241718158-242073207 - | 10000
## ... ... ... ... . ...
## 9991 chr9 114020538-114135733 - | 9991
## 9992 chr21 34658193-34665310 + | 9992
## 9993 chr22 17403795-17489967 - | 9993
## 9994 chr6 90596334-90640876 + | 9994
## 9997 chr22 49308863-49310900 - | 9997
## -------
## seqinfo: 49 sequences (1 circular) from hg18 genome
<- geneLocations$gene_id
IDs <- select(org.Hs.eg.db, keys = IDs, keytype = "ENTREZID",
myTable columns = c("SYMBOL", "GENENAME", "ENTREZID") )
1:2,] myTable[
## ENTREZID SYMBOL GENENAME
## 1 1 A1BG alpha-1-B glycoprotein
## 2 10 NAT2 N-acetyltransferase 2