class: center, middle, inverse, title-slide # Genomic Features In Bioconductor
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/Bioconductor_Introduction/
--- class: inverse, center, middle # Genomic Features <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Genomic Features Genomic Features are often stored as GTF or GFF files. <div align="center"> <img src="imgs/gff.png" alt="igv" height="200" width="600"> </div> --- ## Genes as Genomic Features In particular, we often find gene models as GFF or GTF files. This importantly allows us to associate: * Exons with transcripts. * Transcripts with genes. With this mapping we can perform complex operations such as summarising genome wide RNAseq signal to genes/transcripts expressions. --- ## IGV and GFF/GTF We use GFF and GTF in IGV to display gene models. <div align="center"> <img src="imgs/fullGenes.png" height="460" width="800"> </div> --- ## A Single Gene Model in GTF Format GTF is perhaps the most common format for gene models. In this example we have a GTF file containing a single gene model for the **Xkr4** gene. This can be found in **data/Xkr4.gtf** <div align="center"> <img src="imgs/gtfInTerminal.png" height="138" width="800"> </div> --- ## A Single Gene Model in IGV IGV parses information from the GTF file to construct the gene model we see containing untranslated regions and directionality of expression. <div align="center"> <img src="imgs/gtfInIGV.png" height="500" width="800"> </div> --- ## Genomic Features in Bioconductor In Bioconductor, gene models are prepackaged for us in the **TxDb** packages. Format is TxDb. **species** . **source** . **major version** . **table** Homo Sapiens gene build from UCSC's version hg19 known gene table - TxDb.Hsapiens.UCSC.hg19.knownGene <div align="center"> <img src="imgs/txdb.png" alt="igv" height="400" width="400"> </div> --- ## Genomic Features in Bioconductor We can however make use of GTF/GFF files of our own gene models outside of Bioconductor pre-built TxDB packages. The **GenomicFeatures** package has functions to import a GTF/GFF into a Bioconductor **TxDb** object as well as to interact with **TxDb** objects. - [**GenomicFeatures**](https://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html) - Import to and handling of **TxDB** objects in R. --- ## Genomic Features in Bioconductor The first package we will look at is the **TxDB** package - **TxDb.Hsapiens.UCSC.hg19.knownGene**. Remember we can install Bioconductor packages (fairly) easily using Bioconductor package pages' provided commands. ```r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene") ``` --- ## Genomic Intervals in Bioconductor Now we have the package installed, we can load the library. ```r library(TxDb.Hsapiens.UCSC.hg19.knownGene) ``` --- ## TxDb Once we have loaded the **TxDb.Hsapiens.UCSC.hg19.knownGene** library, we will have a new **TxDb** object available to us called **TxDb.Hsapiens.UCSC.hg19.knownGene**. ```r class(TxDb.Hsapiens.UCSC.hg19.knownGene) ``` ``` ## [1] "TxDb" ## attr(,"package") ## [1] "GenomicFeatures" ``` --- ## TxDb When this variable name is typed in into console we can retrieve some key information on the Gene Model build as well as summary metrics. ```r TxDb.Hsapiens.UCSC.hg19.knownGene ``` ``` ## TxDb object: ## # Db type: TxDb ## # Supporting package: GenomicFeatures ## # Data source: UCSC ## # Genome: hg19 ## # Organism: Homo sapiens ## # Taxonomy ID: 9606 ## # UCSC Table: knownGene ## # Resource URL: http://genome.ucsc.edu/ ## # Type of Gene ID: Entrez Gene ID ## # Full dataset: yes ## # miRBase build ID: GRCh37 ## # transcript_nrow: 82960 ## # exon_nrow: 289969 ## # cds_nrow: 237533 ## # Db created by: GenomicFeatures package from Bioconductor ## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015) ## # GenomicFeatures version at creation time: 1.21.30 ## # RSQLite version at creation time: 1.0.0 ## # DBSCHEMAVERSION: 1.1 ``` --- ## Accessing Information As suggested in the name, **TxDb** objects are a database containing information we would store in GTF/GFF files. To access information in our **TxDb** object we will need to use special accessor functions. These accessors return information in **GRanges'** formats allowing us to work with methods from the **GenomicRanges** packages. --- # Extracting Genomic Locations We have a number of functions which allow us to extract GRanges for our features of interest including: * **genes()** - Extract gene locations as a **GRanges**. * **transcripts()** - Extract transcript locations as a **GRanges**. * **exons()** - Extract exon locations as a **GRanges**. * **cds()** - Extract coding locations as a **GRanges**. * **promoters()** - Extract promoter locations as a **GRanges**. --- ## Extracting Genes We can extract genes with the **genes()** function and the **TxDb** object as a parameter. ```r myGenes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) ``` ``` ## 403 genes were dropped because they have exons located on both strands ## of the same reference sequence or on more than one reference sequence, ## so cannot be represented by a single genomic range. ## Use 'single.strand.genes.only=FALSE' to get all the genes in a ## GRangesList object, or use suppressMessages() to suppress this message. ``` ```r myGenes ``` ``` ## GRanges object with 23056 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 1 chr19 58858172-58874214 - | 1 ## 10 chr8 18248755-18258723 + | 10 ## 100 chr20 43248163-43280376 - | 100 ## 1000 chr18 25530930-25757445 - | 1000 ## 10000 chr1 243651535-244006886 - | 10000 ## ... ... ... ... . ... ## 9991 chr9 114979995-115095944 - | 9991 ## 9992 chr21 35736323-35743440 + | 9992 ## 9993 chr22 19023795-19109967 - | 9993 ## 9994 chr6 90539619-90584155 + | 9994 ## 9997 chr22 50961997-50964905 - | 9997 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Exons We can extract exons with the **exons()** function and the **TxDb** object as a parameter in the same way. ```r myExons <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene) myExons ``` ``` ## GRanges object with 289969 ranges and 1 metadata column: ## seqnames ranges strand | exon_id ## <Rle> <IRanges> <Rle> | <integer> ## [1] chr1 11874-12227 + | 1 ## [2] chr1 12595-12721 + | 2 ## [3] chr1 12613-12721 + | 3 ## [4] chr1 12646-12697 + | 4 ## [5] chr1 13221-14409 + | 5 ## ... ... ... ... . ... ## [289965] chrUn_gl000241 35706-35859 - | 289965 ## [289966] chrUn_gl000241 36711-36875 - | 289966 ## [289967] chrUn_gl000243 11501-11530 + | 289967 ## [289968] chrUn_gl000243 13608-13637 + | 289968 ## [289969] chrUn_gl000247 5787-5816 - | 289969 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Transcripts We can extract transcripts with the **transcripts()** functions and the **TxDb** object as a parameter. ```r myTranscripts <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene) myTranscripts ``` ``` ## GRanges object with 82960 ranges and 2 metadata columns: ## seqnames ranges strand | tx_id tx_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr1 11874-14409 + | 1 uc001aaa.3 ## [2] chr1 11874-14409 + | 2 uc010nxq.1 ## [3] chr1 11874-14409 + | 3 uc010nxr.1 ## [4] chr1 69091-70008 + | 4 uc001aal.1 ## [5] chr1 321084-321115 + | 5 uc001aaq.2 ## ... ... ... ... . ... ... ## [82956] chrUn_gl000237 1-2686 - | 82956 uc011mgu.1 ## [82957] chrUn_gl000241 20433-36875 - | 82957 uc011mgv.2 ## [82958] chrUn_gl000243 11501-11530 + | 82958 uc011mgw.1 ## [82959] chrUn_gl000243 13608-13637 + | 82959 uc022brq.1 ## [82960] chrUn_gl000247 5787-5816 - | 82960 uc022brr.1 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Coding Regions We can extract coding regions with the **cds()** functions and the **TxDb** object as a parameter. ```r myCDS <- cds(TxDb.Hsapiens.UCSC.hg19.knownGene) myCDS ``` ``` ## GRanges object with 237533 ranges and 1 metadata column: ## seqnames ranges strand | cds_id ## <Rle> <IRanges> <Rle> | <integer> ## [1] chr1 12190-12227 + | 1 ## [2] chr1 12595-12721 + | 2 ## [3] chr1 13403-13639 + | 3 ## [4] chr1 69091-70008 + | 4 ## [5] chr1 324343-324345 + | 5 ## ... ... ... ... . ... ## [237529] chrUn_gl000228 30530-31035 - | 237529 ## [237530] chrUn_gl000228 31353-31430 - | 237530 ## [237531] chrUn_gl000228 31660-31734 - | 237531 ## [237532] chrUn_gl000228 31660-31737 - | 237532 ## [237533] chrUn_gl000228 31996-32173 - | 237533 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Additional Information We can control which additional information such as transcript ID and gene ID we retrieve in metadata columns using the **columns** parameter in the **exons()/transcripts()/cds()** functions. For a full list of available columns see **?transcripts()** function help. ```r myTranscripts <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene, columns=c("gene_id","tx_id")) myTranscripts[1:2] ``` ``` ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | gene_id tx_id ## <Rle> <IRanges> <Rle> | <CharacterList> <integer> ## [1] chr1 11874-14409 + | 100287102 1 ## [2] chr1 11874-14409 + | 100287102 2 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Promoter Regions Another useful function is the **promoters()** function. This accepts additional parameters of **upstream** and **downstream** to specify regions around TSS to be used as a promoter while accounting for a transcripts's strand. We have been using the **resize()** function with **GRanges** objects for the same purpose. ```r myPromoters <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene, upstream=2000,downstream=50) myPromoters[1:2] ``` ``` ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | tx_id tx_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## uc001aaa.3 chr1 9874-11923 + | 1 uc001aaa.3 ## uc010nxq.1 chr1 9874-11923 + | 2 uc010nxq.1 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Groups of Genomic Features A major component of gene models in GFF and GTF files is the grouping of exons to transcripts/genes and transcripts to genes. We can extract lists of GRanges (a GRangesList) containing the grouping of features to metafeatures using additional accessor functions. These include the **transcriptsBy()**, **exonsBy()** and **cdsBy()**. --- ## Extracting Groups of Genomic Features The **transcriptsBy()** function takes as argument a **TxDb** object and a **by** argument specifying the metafeature/feature to group by. For the **transcriptsBy()**, the **by** argument may be gene, exon or cds. ```r transcriptByGenes <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene") transcriptByGenes[1:2] ``` ``` ## GRangesList object of length 2: ## $`1` ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | tx_id tx_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr19 58858172-58864865 - | 70455 uc002qsd.4 ## [2] chr19 58859832-58874214 - | 70456 uc002qsf.2 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ## ## $`10` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | tx_id tx_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr8 18248755-18258723 + | 31944 uc003wyw.1 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Groups of Genomic Features As with all lists we can extract information from elements in the GRangesList using the **$** and **[[]]** accessors. This will extract the grouped intervals as a GRanges. ```r # transcriptByGenes$1 or transcriptByGenes[[1]] ``` ``` ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | tx_id tx_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr19 58858172-58864865 - | 70455 uc002qsd.4 ## [2] chr19 58859832-58874214 - | 70456 uc002qsf.2 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Extracting Groups of Genomic Features Similarly the very useful **exonsBy()** function takes again an argument for a **TxDb** object and a **by** argument. For the **exonsBy()** and **cdsBy()** functions, the **by** argument may be gene, or transcript. ```r exonsByTranscript <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="tx") exonsByTranscript[1:2] ``` ``` ## GRangesList object of length 2: ## $`1` ## GRanges object with 3 ranges and 3 metadata columns: ## seqnames ranges strand | exon_id exon_name exon_rank ## <Rle> <IRanges> <Rle> | <integer> <character> <integer> ## [1] chr1 11874-12227 + | 1 <NA> 1 ## [2] chr1 12613-12721 + | 3 <NA> 2 ## [3] chr1 13221-14409 + | 5 <NA> 3 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ## ## $`2` ## GRanges object with 3 ranges and 3 metadata columns: ## seqnames ranges strand | exon_id exon_name exon_rank ## <Rle> <IRanges> <Rle> | <integer> <character> <integer> ## [1] chr1 11874-12227 + | 1 <NA> 1 ## [2] chr1 12595-12721 + | 2 <NA> 2 ## [3] chr1 13403-14409 + | 6 <NA> 3 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Creating TxDb from External Sources Often we will have our favourite gene models stored in a GTF outside of Bioconductor objects. If we wish to take advantage of all the features in Bioconductor we will need to import it into a **TxDb** object. We can do this using the **makeTxDbFromGFF()** function which will parse GTF or GFF into a **TxDb** object ```r myCustomTxDb <- makeTxDbFromGFF("data/Xkr4.gtf") ``` ``` ## Import genomic features from the file as a GRanges object ... OK ## Prepare the 'metadata' data frame ... OK ## Make the TxDb object ... OK ``` ```r class(myCustomTxDb) ``` ``` ## [1] "TxDb" ## attr(,"package") ## [1] "GenomicFeatures" ``` --- ## Creating TxDb from External Sources ```r myCustomTxDb ``` ``` ## TxDb object: ## # Db type: TxDb ## # Supporting package: GenomicFeatures ## # Data source: data/Xkr4.gtf ## # Organism: NA ## # Taxonomy ID: NA ## # miRBase build ID: NA ## # Genome: NA ## # Nb of transcripts: 1 ## # Db created by: GenomicFeatures package from Bioconductor ## # Creation time: 2021-07-16 00:25:43 +0000 (Fri, 16 Jul 2021) ## # GenomicFeatures version at creation time: 1.44.0 ## # RSQLite version at creation time: 2.2.7 ## # DBSCHEMAVERSION: 1.2 ``` --- ## Creating TxDb from External Sources Now we have our own custom GTF object we can make use of the functions from **GenomicFeatures** package such as retrieving gene coordinates as a **GRanges**. ```r genes(myCustomTxDb) ``` ``` ## GRanges object with 1 range and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## Xkr4 chr1 3214482-3671498 - | Xkr4 ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- ## Creating TxDb from External Sources Or exons grouped by genes as a **GRangesList**. ```r exonsBy(myCustomTxDb,by="gene") ``` ``` ## GRangesList object of length 1: ## $Xkr4 ## GRanges object with 3 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr1 3214482-3216968 - | 1 <NA> ## [2] chr1 3421702-3421901 - | 2 <NA> ## [3] chr1 3670552-3671498 - | 3 <NA> ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- ## Creating TxDb from UCSC We can create a TxDb directly from UCSC's database using the **makeTxDbFromUCSC()** function. The **makeTxDbFromUCSC** function simply requires a genome name to build a **TxDb** object. We can review the genomes available to us using the **ucscGenomes()** function from the **rtracklayer** package. This returns a data frame of genomes from which we can build a TxDB object. ```r library(rtracklayer) availableGenomes <- ucscGenomes() availableGenomes[1:4,] ``` ``` ## db species date ## 1 ailMel1 Panda Dec. 2009 ## 2 allMis1 American alligator Aug. 2012 ## 3 anoCar1 Lizard Feb. 2007 ## 4 anoCar2 Lizard May 2010 ## name ## 1 BGI-Shenzhen AilMel 1.0 Dec. 2009 ## 2 International Crocodilian Genomes Working Group ## 3 Broad Institute AnoCar (1.0) ## 4 Broad Institute of MIT and Harvard AnoCar 2.0 (GCA_000090745.1) ``` --- ## Creating TxDb from UCSC Once we have identified the genome build we wish to use we can specify the genome parameter in the **makeTxDbFromUCSC()** function. <!-- # issue here atm. UCSC_dbselect command sets us rmariadb connection. maybe have to override this and change ports used?. For the moment i have just added a little work aorund by saving the object. --> ```r hg18TxDb <- makeTxDbFromUCSC(genome="hg18") hg18TxDb ``` ``` ## TxDb object: ## # Db type: TxDb ## # Supporting package: GenomicFeatures ## # Data source: UCSC ## # Genome: hg18 ## # Organism: Homo sapiens ## # Taxonomy ID: 9606 ## # UCSC Table: knownGene ## # UCSC Track: UCSC Genes ## # Resource URL: http://genome.ucsc.edu/ ## # Type of Gene ID: Entrez Gene ID ## # Full dataset: yes ## # miRBase build ID: NA ## # Nb of transcripts: 66803 ## # Db created by: GenomicFeatures package from Bioconductor ## # Creation time: 2021-03-04 18:35:05 -0500 (Thu, 04 Mar 2021) ## # GenomicFeatures version at creation time: 1.40.1 ## # RSQLite version at creation time: 2.2.3 ## # DBSCHEMAVERSION: 1.2 ``` --- ## Creating TxDb from UCSC And again we can now make use of **GenomicFeatures** package's functions with our new TxDb objects ```r hg18Promoters <- promoters(hg18TxDb,2000,50) ``` --- ## Exporting a TxDb Object as GTF/GFF File We can make use of the **export.gff()** in the **rtracklayer** package to export our **TxDb** object to a gtf or gff file. We can include an additional argument of **format** to the **export.gff()** function. We can specify the **format** to be **gff** or **gtf** as shown below. ```r export.gff(myCustomTxDb,con="customTxDbb.gff",format="gff") export.gff(myCustomTxDb,con="customTxDbb.gtf",format="gtf") ``` --- ## Useful Functions in GenomicFeatures We have seen that we can extract transcripts and exons grouped by their genes/transcripts using the **transcriptsBy()** and **exonsBy()** functions. To identify the total number of groups we can simply use the **length()** function. Here we have 23459 transcipts or groups of exons. ```r transcriptByGenes <- exonsBy(hg18TxDb,by="gene") length(transcriptByGenes) ``` ``` ## [1] 20121 ``` --- ## Useful Functions in GenomicFeatures To find the number of exons in every group/transcript we can use the function **lengths()**. The **lengths()** function will tell us the length of every group in our **GRangesList** ```r transcriptNumberPerGene <- lengths(transcriptByGenes) transcriptNumberPerGene[1:5] ``` ``` ## 1 10 100 1000 10000 ## 15 3 12 16 16 ``` --- ## Useful Functions in GenomicFeatures We may wish to know the total sum length of all exons in every transcript. The **transcriptLengths()** function allows us to find all transcript lengths from a TxDb object as well as additional information. ```r transcript_Lens <- transcriptLengths(hg18TxDb) transcript_Lens[1:5,] ``` ``` ## tx_id tx_name gene_id nexon tx_len ## 1 1 uc001aaa.2 <NA> 3 2122 ## 2 2 uc009vip.1 <NA> 2 2772 ## 3 3 uc009vjg.1 <NA> 3 708 ## 4 4 uc009vjh.1 79501 3 888 ## 5 5 uc001aal.1 79501 1 918 ``` --- ## Extracting Sequence Information As with many Bioconductor packages, there is interoperability between **BSgenome** packages holding genome sequences and **TxDb** objects with gene models. One very useful function is the **extractTranscriptSeqs()** function. The **extractTranscriptSeqs()** accepts a BSgenome object of genomic sequence of interest and a TxDb object. ```r library(BSgenome.Hsapiens.UCSC.hg19) hg19TransSeq <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, transcripts=TxDb.Hsapiens.UCSC.hg19.knownGene) hg19TransSeq ``` ``` ## DNAStringSet object of length 82960: ## width seq names ## [1] 1652 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 1 ## [2] 1488 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 2 ## [3] 1595 CTTGCCGTCAGCCTTTTCTTT...AAGCACACTGTTGGTTTCTG 3 ## [4] 918 ATGGTGACTGAATTCATTTTT...ATTCTAGTGTAAAGTTTTAG 4 ## [5] 32 TACAGACCAAGCTCATGACTCACAATGGCCTA 5 ## ... ... ... ## [82956] 1217 GCCAGTTTAGGGTCTCTGGTA...AACCTCCGCCTCCTGAGATC 82956 ## [82957] 737 CTCCACTTCTGATCCTCCCCG...CCGCTTTATTAGATGCAGTG 82957 ## [82958] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA 82958 ## [82959] 30 TTGCAGAGGTGGCTGGTTGCTCTTTGAGCC 82959 ## [82960] 30 TGGTGAATTTCAGCCAAAGTGGCCAAAGAA 82960 ``` --- ## Extracting Sequence Information We can then either work with the resulting DNAStringSet then in R or write to Fasta file for analysis in external programs. ```r writeXStringSet(hg19TransSeq,"myTranscriptSequences.fa") ``` --- ## Changing Naming Conventions So far we have used the standard UCSC style of chromosome names - Chr1, Chr2, Chr3 .. Sadly, conventions for chromosome and contig names can vary across different annotation sources. Ensembl for instance uses. - 1, 2, 3, 4 ... --- ## Changing Chromosome Naming Styles The **GenomeInfoDb** package and associated **GenomeInfoDbData** package contain mappings between differing naming conventions and functions to convert between them. First we need to load the **GenomeInfoDb** library. ```r library(GenomeInfoDb) ``` --- ## Reviewing Chromosome Naming Styles We can review all the mappings for supported organisms' genomes using the **genomeStyles()** function with no arguments. The **genomeStyles()** function returns a list of mapping data.frames of each organism. ```r allMappings <- genomeStyles() names(allMappings) ``` ``` ## [1] "Arabidopsis_thaliana" "Caenorhabditis_elegans" ## [3] "Canis_familiaris" "Cyanidioschyzon_merolae" ## [5] "Drosophila_melanogaster" "Homo_sapiens" ## [7] "Mus_musculus" "Oryza_sativa" ## [9] "Populus_trichocarpa" "Rattus_norvegicus" ## [11] "Saccharomyces_cerevisiae" "Zea_mays" ``` --- ## Chromosome Naming Style We can then review mappings for our organism of interest's genomes by retrieving the relevant entry from our list returned by the **genomeStyles()** function. The resulting data.frame contains information on contig/chromosome mappings as well as additional information on chromosomes, i.e. Circularity. ```r #allMappings$Homo_sapiens or allMappings[["Homo_sapiens"]] ``` ``` ## circular auto sex NCBI UCSC dbSNP Ensembl ## 1 FALSE TRUE FALSE 1 chr1 ch1 1 ## 2 FALSE TRUE FALSE 2 chr2 ch2 2 ## 3 FALSE TRUE FALSE 3 chr3 ch3 3 ## 4 FALSE TRUE FALSE 4 chr4 ch4 4 ## 5 FALSE TRUE FALSE 5 chr5 ch5 5 ## 6 FALSE TRUE FALSE 6 chr6 ch6 6 ## 7 FALSE TRUE FALSE 7 chr7 ch7 7 ## 8 FALSE TRUE FALSE 8 chr8 ch8 8 ## 9 FALSE TRUE FALSE 9 chr9 ch9 9 ## 10 FALSE TRUE FALSE 10 chr10 ch10 10 ## 11 FALSE TRUE FALSE 11 chr11 ch11 11 ## 12 FALSE TRUE FALSE 12 chr12 ch12 12 ## 13 FALSE TRUE FALSE 13 chr13 ch13 13 ## 14 FALSE TRUE FALSE 14 chr14 ch14 14 ## 15 FALSE TRUE FALSE 15 chr15 ch15 15 ## 16 FALSE TRUE FALSE 16 chr16 ch16 16 ## 17 FALSE TRUE FALSE 17 chr17 ch17 17 ## 18 FALSE TRUE FALSE 18 chr18 ch18 18 ## 19 FALSE TRUE FALSE 19 chr19 ch19 19 ## 20 FALSE TRUE FALSE 20 chr20 ch20 20 ## 21 FALSE TRUE FALSE 21 chr21 ch21 21 ## 22 FALSE TRUE FALSE 22 chr22 ch22 22 ## 23 FALSE FALSE TRUE X chrX chX X ## 24 FALSE FALSE TRUE Y chrY chY Y ## 25 TRUE FALSE FALSE MT chrM chMT MT ``` --- ## Changing Chromosome Naming Styles The **seqlevelsStyle()** function allows us to review as well as set the naming convention for our GenomicRanges or TxDb objects. First lets review the genes from the **TxDb.Hsapiens.UCSC.hg19.knownGene** package. Here we see they are **UCSC** style. ```r myGenes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) ``` ``` ## 403 genes were dropped because they have exons located on both strands ## of the same reference sequence or on more than one reference sequence, ## so cannot be represented by a single genomic range. ## Use 'single.strand.genes.only=FALSE' to get all the genes in a ## GRangesList object, or use suppressMessages() to suppress this message. ``` ```r seqlevelsStyle(myGenes) ``` ``` ## [1] "UCSC" ``` ```r myGenes[1:2,] ``` ``` ## GRanges object with 2 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 1 chr19 58858172-58874214 - | 1 ## 10 chr8 18248755-18258723 + | 10 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Changing Chromosome Naming Styles Now we can update them to **Ensembl**, **NCBI** or **dbSNP** styles. Here we update to **Ensembl** chromosome naming conventions. This conversion is essential if we want to use our objects with data generated following Ensembl styles. ```r seqlevelsStyle(myGenes) <- "Ensembl" myGenes[1:2,] ``` ``` ## GRanges object with 2 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 1 19 58858172-58874214 - | 1 ## 10 8 18248755-18258723 + | 10 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Gene Annotation Where as the **TxDb** packages hold information on gene models such as exons positions and exon to gene mapping, the **OrgDb** packages contain information on gene's mapping to alternative IDs and any functional annotation. Information on model organism's gene annotation is contained with the **org.db** packages. Format is org. **species** . **ID type** .db Homo Sapiens annotation with Entrez Gene IDs -- org.Hs.eg.db <div align="center"> <img src="imgs/orgdb.png" alt="igv" height="400" width="400"> </div> --- ## Gene Annotation To gain access to gene annotation we must first load the library for the relevant species. Here we install and load the library for Human, Org.Hs.eg.db. ```r BiocManager::install("org.Hs.eg.db") library(org.Hs.eg.db) class(org.Hs.eg.db) ``` ``` ## [1] "OrgDb" ## attr(,"package") ## [1] "AnnotationDbi" ``` --- ## Org.db AnnotationDbi Accessor Functions As with the TxDb packages, we will need to use special accessors to retrieve information from Org.db. We can use the functions: - **columns()** - Display what kind of annotation is available in OrgDb objects. - **keytypes()** - Displays which type of identifiers can be used with **select** function. - **keys()** - Returns keys (index) for the database contained in the OrgDb object. Used along with **keytypes()** in <b>select</b> function to retrieve interested annotation - **select** - Retrieve the annotation data as a data.frame based on the supplied keys, keytypes and columns. We will explore how to retrieve annotation from gene-centric organism level annotation package (org.Hs.eg.db) using these functions. --- ## Accessing Annotation from Org.Db We can use the **columns()** function first to list available annotation. Here we see our annotation for alternative IDs (ENSEMBL, REFSEQ, GENENAME) as well as functional annotation (GO, PATH, OMIM) ```r columns(org.Hs.eg.db) ``` ``` ## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" ## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" ## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP" ## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" ## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" ## [26] "UNIPROT" ``` -- To know more about the above identifier types ```r help(GENENAME) ``` --- ## Accessing Annotation from org.Hs.eg.db Which keytypes can be used to query this database can be retrieved using keytypes. ```r keytypes(org.Hs.eg.db) ``` ``` ## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" ## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME" ## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP" ## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" ## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" ## [26] "UNIPROT" ``` --- ## Accessing Annotation from Org.Db If we want to extract few identifiers of a particular keytype, we can use keys() function ```r keys(org.Hs.eg.db, keytype="SYMBOL")[1:10] ``` ``` ## [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP" ## [7] "SERPINA3" "AADAC" "AAMP" "AANAT" ``` --- ## Accessing Annotation from Org.Db We can extract other annotations for a particular identifier using <b>select</b> function ```r select(org.Hs.eg.db, keys = "A1BG", keytype = "SYMBOL", columns = c("SYMBOL", "GENENAME", "CHR") ) ``` ``` ## SYMBOL GENENAME CHR ## 1 A1BG alpha-1-B glycoprotein 19 ``` --- ## Annotating Results We can use this now to translate our Entrez IDs from TxDb object to a gene name ```r geneLocations <- genes(hg18TxDb) ``` ``` ## 379 genes were dropped because they have exons located on both strands ## of the same reference sequence or on more than one reference sequence, ## so cannot be represented by a single genomic range. ## Use 'single.strand.genes.only=FALSE' to get all the genes in a ## GRangesList object, or use suppressMessages() to suppress this message. ``` ```r geneLocations ``` ``` ## GRanges object with 19742 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 1 chr19 63549984-63565932 - | 1 ## 10 chr8 18293035-18303003 + | 10 ## 100 chr20 42681577-42713790 - | 100 ## 1000 chr18 23784933-24011189 - | 1000 ## 10000 chr1 241718158-242073207 - | 10000 ## ... ... ... ... . ... ## 9991 chr9 114020538-114135733 - | 9991 ## 9992 chr21 34658193-34665310 + | 9992 ## 9993 chr22 17403795-17489967 - | 9993 ## 9994 chr6 90596334-90640876 + | 9994 ## 9997 chr22 49308863-49310900 - | 9997 ## ------- ## seqinfo: 49 sequences (1 circular) from hg18 genome ``` --- ## Annotating Results ```r IDs <- geneLocations$gene_id myTable <- select(org.Hs.eg.db, keys = IDs, keytype = "ENTREZID", columns = c("SYMBOL", "GENENAME", "ENTREZID") ) myTable[1:2,] ``` ``` ## ENTREZID SYMBOL GENENAME ## 1 1 A1BG alpha-1-B glycoprotein ## 2 10 NAT2 N-acetyltransferase 2 ``` --- ## Time for an Exercise [Link_to_exercises](../../exercises/exercises/GenomicFeatures_exercise.html) [Link_to_answers](../../exercises/answers/GenomicFeatures_answers.html)