Overview

  • Set up
  • Raw data to visualisations
    • Raw data to visualisations in R (Quick)
  • Visualising signal over genomic region sets

Materials

All prerequisites, links to material and slides for this course can be found on github.

Or can be downloaded as a zip archive from here.

Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived GenomicHeatmapsAndProfiles folder in the Rstudio menu.

Session -> Set Working Directory -> Choose Directory

or in the console.

setwd("/PathToMyDownload/GenomicHeatmapsAndProfiles/r_course")
## For Fuchs Lab
# e.g. setwd("/Workshop_Data/TestFolder/")

Genomics data visualisations

A common task in genomics is to review your data in your favourite genome browser. Here we are using IGV and our material is available here

offset

Genomics data visualisations

We need to work with aligned data for visualisation and analysis in our genome of choice. Aligned sequence data is often stored in SAM format.

SAM - Sequence Alignment Map

igv

  • Contains read and alignment information including genomic location

Genomics data visualisations

These format are easy for us to read but inefficent for software to work with. When using software we will most often work with highly compressed and indexed equivalent files.

  • SAM – BAM (with a .bai index)
  • wig/bedGraph – bigWig

Full workflows available.

In the next few slides we are going to run through an alignment of ChIP-seq FastQ to sorted and indexed BAM files and then create a bigWig.

For a full summary of this processing please see our relevant courses

A reference genome in R.

We can also extract genomes from Bioconductor objects inside of R. Here we use the BSgenome.Hsapiens.UCSC.hg19 to extract the sequence for major chromosomes and write out to a FASTA file.

require(BSgenome.Hsapiens.UCSC.hg19)
mainChromosomes <- paste0("chr",c(1:22,"X","Y","M"))
mainChrSeq <- lapply(mainChromosomes,
                     function(x)BSgenome.Hsapiens.UCSC.hg19[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
writeXStringSet(mainChrSeqSet,
                "BSgenome.Hsapiens.UCSC.hg19_majorChr.fa")

FastQ to BAM and bigWigs.

Once we have our index we can align our FastQ to the genome using the align function to create a BAM file.

align("BSgenome.Hsapiens.UCSC.hg19",
      "ENCFF001NQP.fastq.gz",
      output_file="ENCFF001NQP.bam",
      type="dna")

FastQ to BAM and bigWigs.

Now we have a indexed BAM file we can easily extract the total mapped reads.

MappedReads <- idxstatsBam("Sorted_ENCFF001NQP.bam")
TotalMappedReads <- sum(MappedReads$mapped)
## [1] 10302212

class: inverse, center, middle

Visualising signal over genomic region sets.


Multipanel review in IGV

In IGV we can also review multiple loci in one screen to get an overview of signal over a group of regions.

offset

ProfilePlyr package

Many requests for complex heatmaps and summary statistics of data visualised within.

.pull-left[ Profileplyr is available from Bioconductor and developed with Doug Barrows to - Allow import/export between Deeptools and R framework - Provide easy tools to manipulate and summarise heatmap data. ] .pull-right[

igv

]

Getting hold of data.

First we need to get hold of some data to plot.

For demonstration we will take a small processed data set of bigWigs from a paper containing ChIP-seq for the transcription factos ZBTB1 and ATf4.

We can retrieve raw and processed data for GSM4332935, GSM4332934, GSM4332940, GSM4332950 and GSM4332944 directly from GEO.

igv

Getting hold of data.

We can use the getGEOSuppFiles function to download any supplementary files to our local working directory.

require(GEOquery)

GSM4332935_BigWig <- getGEOSuppFiles("GSM4332935",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332934_BigWig <- getGEOSuppFiles("GSM4332934",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332944_BigWig <- getGEOSuppFiles("GSM4332944",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332940_BigWig <- getGEOSuppFiles("GSM4332940",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332950_BigWig <- getGEOSuppFiles("GSM4332950",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")

GRanges

The BED file regions are stored in R as a GRanges object.

GRanges objects provide much of the functionality seen in BedTools.

ZBTB1_GR
## GRanges object with 593 ranges and 2 metadata columns:
##         seqnames              ranges strand |                      name
##            <Rle>           <IRanges>  <Rle> |               <character>
##     [1]     chr2   70660347-70663275      * |    chr2:70660347-70663275
##     [2]    chr12     6036980-6042542      * |     chr12:6036980-6042542
##     [3]    chr13 113526647-113528481      * | chr13:113526647-113528481
##     [4]     chr5       881483-884047      * |        chr5:881483-884047
##     [5]    chr17   79397761-79400379      * |   chr17:79397761-79400379
##     ...      ...                 ...    ... .                       ...
##   [589]    chr11     9024476-9026040      * |     chr11:9024476-9026040
##   [590]     chr1   58387303-58388627      * |    chr1:58387303-58388627
##   [591]    chr17   32834453-32835985      * |   chr17:32834453-32835985
##   [592]     chrY   13683928-13693098      * |    chrY:13683928-13693098
##   [593]    chr10   90901936-90905010      * |   chr10:90901936-90905010
##             score
##         <numeric>
##     [1]         0
##     [2]         0
##     [3]         0
##     [4]         0
##     [5]         0
##     ...       ...
##   [589]         0
##   [590]         0
##   [591]         0
##   [592]         0
##   [593]         0
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Overlaps in GRanges

The function %over% allows us to evaluate which regions overlap/intersect between two region sets.

  • A[A %over% B] – Returns all regions in A which overlap regions in B
  • B[B %over% A] – Returns all regions in B which overlap regions in A
  • A[!A %over% B] – Returns all regions in A which do not overlap regions in B

Heatmaps in Deeptools

The DeepTools software provides a set of tools to convert between file types (i.e. BAM to bigWig) and to plot signal from BAM or bigWig over a set of regions as a heatmap.

Deeptools course for more information

igv

Creating GenomicHeatmaps

computeMatrix reference-point --referencePoint center 
-bs binSize -a BP_after_BED -b BP_before_BED 
-S  BIGWIGSofINTEREST 
--regionsFileName BEDFILEOFINTEREST 
--outFileName OUTFILE

The computeMatrix function take a set of arguments to generate an intermediate file

  • reference-point –referencePoint = Specify position in regions in BED file to centre heatmap.
  • -bs = Binsize across regions in BED file
  • -b and -a = bp before and after reference point to plot in heatmap.
  • -S = BigWigs to plot signal from
  • –regionsFileName = BED file containing regions to plot over.
  • –outFileName = Name of intermediate file generated.

Creating GenomicHeatmaps

Once we have our intermediate file we can use this to create our heatmap using the plotHeatmap command.

plotHeatmap -m MATRIXFILE 
-o HEATMAP
--colorList lowColour,highColour 
  • -m = Name of intermediate file generated by Deeptools computeMatrix.
  • -o = Output file name for heatmap
  • –colorList = Bottom and top of colour scale for heatmap.

Creating GenomicHeatmaps

The profileplyr package offers a similar set of functionality as the Deeptools software within R. To compute our intermediate matrix we can use the BamBigwig_to_chipProfile function with the same set of inputs as Deeptools.

require(profileplyr)
bigWigs <- dir(full.names=TRUE,pattern = "FLAG")
zbtb1_profile <- BamBigwig_to_chipProfile(bigWigs,
                                   testRanges = "data/beds/ZBTB1Peaks.bed",
                                   style = "point",
                                   format = "bigwig",
                                   distanceAround = 2000)

zbtb1_profile <- as_profileplyr(zbtb1_profile)
zbtb1_profile
## class: profileplyr 
## dim: 593 200 
## metadata(0):
## assays(4): GSM4332935_Sorted_GFP_MinusN_FLAGNormalised.bw
##   GSM4332940_Sorted_GFP_PlusN_FLAGNormalised.bw
##   GSM4332945_Sorted_ZBTB_MinusN_FLAGNormalised.bw
##   GSM4332950_Sorted_ZBTB_PlusN_FLAGNormalised.bw
## rownames(593): giID450 giID451 ... giID562 giID563
## rowData names(5): name score sgGroup giID names
## colnames: NULL
## colData names(0):

Import from DeepTools

However you generate the intermediate file, you can import/export to Deeptools. This allows you take advantage of any functionality in Deeptools and R together.

zbtb1_deeptools <- import_deepToolsMat("data/Deeptools_ZBTB1Peaks.MAT")
generateEnrichedHeatmap(zbtb1_deeptools)

Blacklist

ChIP-seq will often show the presence of common artefacts such as ultra-high signal regions or Blacklists. Such regions can confound peak calling, fragment length estimation and QC metrics.

For more information on Blacklists see the Deeptools guide or our paper.

igv

Creating GenomicHeatmaps

The groupBy by function allows us to add additional information to our Heatmaps. Here we provide the GRanges of our Blacklist BED file and specify to include sites which do not overlap using the include_nonoverlapping argument.

zbtb1_profile_Grouped <- groupBy(zbtb1_profile,
                                 group = blacklist,
                                 include_nonoverlapping = TRUE)
generateEnrichedHeatmap(zbtb1_profile_Grouped)

Creating GenomicHeatmaps

Now we have created our filtered set we could export it back to Deeptools to plot using it’s plotHeatmap function.

export_deepToolsMat(zbtb1_profile_bl,
                    con="zbtb1_Profileplyr.MAT",
                    overwrite = TRUE)

Subsetting

We can subset our heatmap to only review the samples we want using indexing in R.

Our heatmaps have 3 dimension - Ranges/Rows, Bins/Columns, Samples. To subset samples then we can index our profileplyr object as in standard R.

zbtb1_profile_subset <- zbtb1_profile_bl[,,3:4]
generateEnrichedHeatmap(zbtb1_profile_subset)

Joining samples

We can then join our ZBTB1 and ATF4 bigwigs using the common R combine function c().

ztbtb1Andatf4_profile <- c(zbtb1_profile_subset,atf4_profile_bl)
generateEnrichedHeatmap(ztbtb1Andatf4_profile)

Updating sample names

We can directly change the sample names when plotting the heatmap too by using the sample_names argument.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        sample_names=c("zbtb_A","zbtb_B","atf4_A"))

Updating sample information

We can use the information on samples then when we plot our heatmap. Here we tell the generateEnrichedHeatmap() function to colour our heatmaps by their Antibody information we just added.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody")

Clustering ranges by summarised signal

We can now review the clustering by simply passing the new profileplyr object to the generateEnrichedHeatmap function.

We can see that it has created two clusters which appear to contain high or low ATF4 signal.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody")

Joining samples

If we wish to now review the relationship between the clustering and the GRanges group we can use the extra_annotation_columns argument to specify additional information to add to the Heatmap.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody",
                        extra_annotation_columns = "cluster")

Plotting

Now we have our long format data.frame ready we can pass this to ggplot to visualise as violin plots.

require(ggplot2)
ggplot(ztbtb1Andatf4_summarized,
       aes(y=Signal,x=Sample,fill=GR_overlap_names))+
  facet_wrap(Antibody~.,scales = "free")+
  geom_violin()+scale_y_log10()

Annotating to genes

All the annotation is now contained in the row information for the profileplyr object and can be retieved using the rowRanges() function.

annotatedRanges <- rowRanges(ztbtb1Andatf4_profile)
annotatedRanges[1,]
## GRanges object with 1 range and 13 metadata columns:
##           seqnames          ranges strand |                 name     score
##              <Rle>       <IRanges>  <Rle> |          <character> <numeric>
##   giID484     chr1 1508087-1512087      + | chr1:1508899-1511276         0
##                  sgGroup        giID                names  cluster
##                 <factor> <character>          <character> <factor>
##   giID484 ZBTB1Peaks.bed     giID484 chr1:1508899-1511276        1
##           hierarchical_order overlap_matrix GR_overlap_names      name.1
##                    <integer>       <matrix>         <factor> <character>
##   giID484                252              0       no_overlap        <NA>
##             score.1      SYMBOL distanceToTSS
##           <numeric> <character>     <numeric>
##   giID484        NA       SSU72           162
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Annotating to genes

We can use the annotation to label where peaks closest to our genes of interest are in the heatmap

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody",
                        extra_annotation_columns = "cluster",
                        genes_to_label = "ASNS",
                        gene_label_font_size = 18)

Grouping by genes

Now we have named list of the genes of interest, we can use this group our heatmap using the groupBy function and produce our final heatmap.

Here the group argument accepts a list of gene names instead of a GRanges.

ztbtb1Andatf4_profile2 <- groupBy(ztbtb1Andatf4_profile,
                                 group = aa_meta_Genes,
                                 include_nonoverlapping = TRUE)

Grouping by genes

The overlap is quite small so we regroup by a different column and add the gene set as annotation instead by specify “GL_overlap_names” to the extra_annotation_columns parameter

ztbtb1Andatf4_profile2 <- groupBy(ztbtb1Andatf4_profile2,group="cluster")
generateEnrichedHeatmap(ztbtb1Andatf4_profile2,
                        color_by_sample_group = "Antibody",genes_to_label = "ASNS",
                        extra_annotation_columns = "GL_overlap_names",
                        gene_label_font_size = 18)

Contact

Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our GitHub and raise an issue.