An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .
Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.
In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.
For more information about the data please visit GDC publication website and read the paper:
A recorded video explaining this workshop is available at: https://youtu.be/3ftZecz0lU4. Also, other workshop videos are available: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt.
Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.
The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.
# to read txt files
library(readr)
# to transform data into GenomicRanges
library(GenomicRanges)
# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
# For the t.test loop
library(plyr)
# For easy volcano plot
library(TCGAbiolinks)
# For heatmap plot
library(ComplexHeatmap)
library(circlize)
# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.
The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. It is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).
There are mainly two types of ATAC-Seq Counts Matrices raw and normalized which covers mainly two sets of peaks:
If we check the both sets (Files downloaded from GDC: "All cancer type-specific peak sets. [ZIP]" and "Pan-cancer peak set. [TXT]"), the set of peaks "pan-cancer peak set" consists of $~562K$ peaks, and it contains a subset of each "cancer type-specific peak set". We show an example for Esophageal carcinoma (ESCA) below.
# ESCA specific peaks set
atac_esca <- readr::read_tsv("Data/ESCA_peakCalls.txt", col_types = readr::cols())
head(atac_esca)
# pan-cancer peak set
atac_pan <- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt", col_types = readr::cols())
head(atac_pan)
# from the pancan set how many belongs to each cancer type?
table(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])
message("How many of the ESCA peaks are the strongest in the pancan")
plyr::count(atac_esca$name %in% atac_pan$name)
plyr::count(grep("ESCA",atac_pan$name,value = T) %in% atac_esca$name)
However, it is important to highlight that the "pan-cancer peak set" will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the "pan-cancer peak set" consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.
atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))
So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the "ACC_10008" peak, which has a higher score.
"ESCA_17603" %in% atac_pan.gr$name
subsetByOverlaps(atac_pan.gr[atac_pan.gr$name == "ACC_10008"],atac_esca.gr)
subsetByOverlaps(atac_esca.gr,atac_pan.gr[atac_pan.gr$name == "ACC_10008"])
Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp.
unique(width(atac_pan.gr))
unique(width(atac_esca.gr))
In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.
For each set of peaks previously identified, a count matrix was produced. As discribed on the supplemental section "ATAC-seq data analysis – Constructing a counts matrix and normalization":
To obtain the number of independent Tn5 insertions in each peak, first the BAM files were corrected for the Tn5 offset (“+” stranded +4 bp, “-” stranded -5 bp) (16) into a Genomic Ranges object in R using Rsamtools “scanbam”. To get the number of Tn5 insertions per peak, each corrected insertion site (end of a fragment) was counted using “countOverlaps”. This was done for all individual technical replicates and a 562,709 x 796 counts matrix was compiled. From this, a RangedSummarizedExperiment was constructed including peaks as GenomicRanges, a counts matrix, and metadata detailing information for each sample. The counts matrix was then normalized by using edgeR’s “cpm(matrix , log = TRUE, prior.count = 5)” followed by a quantile normalization using preprocessCore’s “normalize.quantiles” in R.
Through the next section we will load the normalized counts data for ESCA and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.
The main file used in this section is included in the following folder.
In the code below, we are showing the beginning of the objects. It is important to highlight
that the samples are using Stanford UUID
instead of TCGA barcodes
and each patient
normally has two replicates.
atac_esca_norm_ct <- readr::read_tsv("Data/ESCA_log2norm.txt", col_types = readr::cols())
atac_esca_norm_ct[1:4,1:8]
We will change the samples names to TCGA barcodes using the file "Lookup table for various TCGA sample identifiers. [TXT]" from GDC. The file path can be found at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG by copying the URL link to the file, readr::read_tsv downloads and reads the table automatically.
#link from gdc.cancer.gov/about-data/publication/ATACseq-AWG, “Lookup table”
gdc.file <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
samples.ids <- readr::read_tsv(gdc.file, col_types = readr::cols())
head(samples.ids)
Now we will match our Standford UUIDs to TCGA barcodes which might be more meaningful https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/ For that we get the columns names removing non-sample names (seqnames, start, end, name, score) and match them to the bam_prefix column. With the matched index (i.e 1st UUID is the 10th row from samples.ids), we will use it to get the TCGA barcode (Case_ID).
colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac_esca_norm_ct)[-c(1:5)]),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]
Now that we have our matrix, we will create a SummarizedExperiment from it. Which will contain 3 matrices, one with the ATAC-seq values, the metadata of the peaks and the samples metadata.
atac <- atac_esca_norm_ct
non.cts.idx <- 1:5
# We will get the samples metadata from GDC using the TCGAbiolinks packages
# we simply need to give the TCGA barcode, and all the metadata available from that sample will be pull out.
# you can check the first sample metadata at GDC
# https://portal.gdc.cancer.gov/cases/f8dbab24-b9f4-4b8a-bfea-57856ccf6364?bioId=3889c9fe-3777-4b3d-9cf1-12135d4d7f7d
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))
# We will raname Squamous cell carcinoma to ESCC and Adenocarcinoma to ESAD (removing other informations)
head(samples.info)[,c("sample","primary_diagnosis")]
samples.map <- gsub(",| |NOS","",gsub("Adenocarcinoma","ESAD",gsub("Squamous cell carcinoma","ESCC",paste0(samples.info$primary_diagnosis,"-",samples.info$sample))))
colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]
Now we have all the data required to create our SummarizedExperiment (SE) object.
# Matrix 1: data
counts <- atac[,-c(1:5)]
head(counts)
# Matrix 2: genomic information
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
rowRanges
# Matrix 3: Samples metadata
# create key for merging
samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
colData <- unique(left_join(samples.info,samples.ids))
head(colData)
esca.rse <- SummarizedExperiment(assays=SimpleList(log2norm=as.matrix(counts)),
rowRanges = rowRanges,
colData = DataFrame(colData))
esca.rse
Since we have two samples for each patient we will rename tham as rep1 and rep2
duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")
colnames(esca.rse)
A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.
escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")
# We will use 2 cores to run the code
library(doParallel)
registerDoParallel(2)
# Time expected ~ 3 min
result <- plyr::adply(assay(esca.rse),.margins = 1,.fun = function(peak){
results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
return(tibble::tibble("raw_p_value"= results$p.value,
"ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2]))
}, .progress = "time", .id = "peak", .parallel = TRUE)
result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")
It is possible to visualize the results of a t.test
as a volcano plot,
which can be used to better select a cut-off for the significant ATAC-Seq peaks.
In this example we will use the $FDR < 0.01$ and $\Delta Log_2 Counts > 2$.
We will be using TCGAbiolinks
package to plot it, but you can do it using ggplot2
or plotly
for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.
fdr.cut.off <- 0.01
diff.cut.off <- 2
TCGAbiolinks:::TCGAVisualize_volcano(x = result$ESCC_minus_ESAD,
y = result$FDR,
title = paste0("Volcano plot - ATAC-seq peaks ",
"difference in ",
"ESCC vs ESAD\n"),
filename = NULL,
label = c("Not Significant",
paste0("High in ESCC (vs ESAD)"),
paste0("Low in ESCC (vs ESAD)")),
ylab = expression(paste(-Log[10],
" (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
)),
xlab = expression(paste(
"Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
)),
x.cut = diff.cut.off,
y.cut = fdr.cut.off)
# How many peaks pass our cut-offs
message(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off))
First, we will load the libraries used to plot the heatmap.
library(ComplexHeatmap)
library(circlize)
# Colors of the heatmap
pal_atac <- colorRampPalette(c('#3361A5',
'#248AF3',
'#14B3FF',
'#88CEEF',
'#C1D5DC',
'#EAD397',
'#FDB31A',
'#E42A2A',
'#A31D1D'))(100)
# Upper track with the samples annotation
ha = HeatmapAnnotation(df = data.frame("Group" = esca.rse$primary_diagnosis,
"Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")),
show_annotation_name = T,
col = list(Group = c("Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")),
show_legend = T,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6))
# Select significant peals to plot
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
# Define the color scale
col <- colorRamp2(seq(min(plot.atac), max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99), pal_atac)
# Show the names of the peaks 1 and 18
rows.annot <- rowAnnotation(foo = anno_mark(at = c(1,18), labels = rownames(plot.atac)[c(1,18)]))
# Plot the ATAC-Seq signals
ht_list <-
Heatmap(plot.atac,
name = "ATAC-seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
A better way to visualize a heatmap is using the z-score transformation on the rows. Z-scores are centered and normalized, so the user can interpret a color as x standard deviations from the mean and have an intuitive idea of the relative variation of that value. This will make the visibility of the heatmap better since it will reduce the range of the values plots. For more information, please read the discussion here.
In R the function scale
can be used, since it works by column we have to transpose the matrix so it is applied to the peaks instead of the samples and then transpose it back.
# Start to plot z-score heatmap
plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score
# Recreate color scheme based on the z-score levels we will truncate it from -2 to 2.
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
ht_list <-
Heatmap(plot.atac.row.z.score,
name = "Row z-score (ATAC-seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
row_title = paste0(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
#column_order = cols.order,
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
#width = unit(15, "cm"),
#column_title = paste0("RNA-seq z-score (n = ", ncol(plot.exp),")"),
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ",
fdr.cut.off,", Diff mean log2 Count > ",
diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
If you want to instead of plot all replicates, to plot a single value for each patient you can get the mean of the values.
# This function will calculate the Means of the peaks for a given group
# in our case we will calculate the mean of the replicates of each patient.
groupMeans <- function(mat, groups = NULL, na.rm = TRUE){
stopifnot(!is.null(groups))
gm <- lapply(unique(groups), function(x){
rowMeans(mat[,which(groups == x),drop = F], na.rm=na.rm)
}) %>% Reduce("cbind",.)
colnames(gm) <- unique(groups)
return(gm)
}
matMerged <- groupMeans(mat = assays(esca.rse)$log2norm, groups = colData(esca.rse)$sample)
# keep only metadata for replicate 1
metadata <- colData(esca.rse)[grep("rep1",rownames(colData(esca.rse))),]
# Create the upper annotation track for the samples
ha = HeatmapAnnotation(df = data.frame("Group" = metadata$primary_diagnosis),
show_annotation_name = TRUE,
col = list(Group = c("Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")),
show_legend = TRUE,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6))
# Select the significant peaks to be plotted
plot.atac <- matMerged[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
# Define the color scheme based on the values
col <- colorRamp2(seq(min(plot.atac), max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99), pal_atac)
# Plot ATAC-Seq signal values
ht_list <-
Heatmap(plot.atac,
name = "ATAC-Seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,",
Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
# Start to plot z-score heatmap
plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score
# Recreate color scheme based on the z-score levels we will truncate it from -2 to 2.
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
ht_list <-
Heatmap(plot.atac.row.z.score,
name = "Row z-score (ATAC-seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")
The ATAC-Seq bigwig files available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.
Here is some information about the bigwig files.
All bigWig files for each cancer type are compressed using tar and gzip. As such, each of the .tgz files contains all of the individual bigWig files for each technical replicate.
We recommend extracting the files using the following command: tar -zxvf file_name.tgz --strip-components 8 where the "--strip-components 8" extracts the files without copying their original directory structure
The provided bigWig files have been normalized by the total insertions in peaks and then binned into 100-bp bins. Each 100-bp bin represents the normalized number of insertions that occurred within the corresponding 100 bp.
The bigwig names also use Stanford UUIDs. The script below will help to rename the bigwifiles with TCGA barcodes. First, we get the path to the downloaded bigwig files after uncompressing them and read the information file from the ATAC-Seq website.
bigwig.files <- dir(path = "ATAC-seq_data/ESCA_bigWigs/",
pattern = ".bw",
all.files = T,
full.names = T)
table <- readr::read_tsv("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c")
plyr::a_ply(bigwig.files,1, function(file) {
file.uuid <- stringr::str_extract(file,
"[:alnum:]{8}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{12}")
idx <- grep(file.uuid,gsub("-","_",table$stanfordUUID))
barcode <- unique(table[idx,]$Case_ID)
if(grepl("ESCA",file)){
samples.info <- TCGAbiolinks:::colDataPrepare(barcode)
barcode <- gsub(",| |NOS","",
gsub("Adenocarcinoma","ESAD",
gsub("Squamous cell carcinoma","ESCC",
paste0(samples.info$primary_diagnosis,"-",
samples.info$sample)
)
)
)
}
# change UUID to barcode
to <- gsub(file.uuid,barcode,file)
file.rename(file, to)
})
Since loading several bigWigs might be pretty slow in software like IGV users might want to reduce the bigwig files to a single chromosome (i.e. chr20). The Rscript below can do it by transforming the bigWig to a wig with only chr20 then converting the wig back to bigWig.
You can download at the executable for bigWigToWig
and wigToBigWig
at ENCODE (http://hgdownload.cse.ucsc.edu/admin/exe/)
and the hg38.chrom.sizes is available at GitHub (https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes)
{R}
chr <- 20
dirout <- paste0("chr",chr)
dir.create(dirout)
files <- dir(path = ".",pattern = "bw",full.names = T)
for(f in files){
f.in <- f
f.out <- gsub("bw","wig",f)
f.out.chr <- file.path(dirout,gsub("\\.bw",paste0("_chr",chr,".bw"),f))
cmd <- paste0("bigWigToWig -chrom=chr",chr," ", f.in," ", f.out)
system(cmd)
cmd <- paste0("wigToBigWig ", f.out," hg38.chrom.sizes ", f.out.chr)
system(cmd)
}
We upload some samples (chromosome 20 only) at the google drive
that can be plotted in R using the karyoploteR
package, which main documentation can be found at https://bernatgel.github.io/karyoploter_tutorial/.
# Load required libraries
suppressMessages({
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
})
# Plot parameters, only to look better
pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0
# Get transcrupts annotation to get HNF4A regions
tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "HNF4A"]
# plot will be at the HNF4A range +- 50Kb
HNF4A.region <- range(c(tssAnnot)) + 50000
# Start by plotting gene tracks
kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex = 0.5, plot.params = pp)
genes.data <- makeGenesDataFromTxDb(TxDb.Hsapiens.UCSC.hg38.knownGene,
karyoplot = kp,
plot.transcripts = TRUE,
plot.transcripts.structure = TRUE)
genes.data <- addGeneNames(genes.data)
genes.data <- mergeTranscripts(genes.data)
kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex = 0.5, plot.params = pp)
kpAddBaseNumbers(kp, tick.dist = 20000, minor.tick.dist = 5000,
add.units = TRUE, cex = 0.4, tick.len = 3)
kpPlotGenes(kp, data = genes.data, r0 = 0, r1 = 0.25, gene.name.cex = 0.5)
# Start to plot bigwig files
big.wig.files <- dir(path = "Data/ESCA_bigwig_chr20/",
pattern = ".bw",
all.files = T,
full.names = T)
big.wig.files
# Reserve area to plot the bigwig files
out.at <- autotrack(1:length(big.wig.files),
length(big.wig.files),
margin = 0.3,
r0 = 0.3,
r1 = 1)
# Add ATAC-seq label from 0.3 to 1 which should cover all ATAC-seq tracks
kpAddLabels(kp,
labels = "ATAC-Seq",
r0 = out.at$r0,
r1 = out.at$r1,
side = "left",
cex = 1,
srt = 90,
pos = 3,
label.margin = 0.1)
for(i in seq_len(length(big.wig.files))) {
bigwig.file <- big.wig.files[i]
# Define where the track will be ploted
# autotrack will simple get the reserved space (from out.at$r0 up to out.at$r1)
# and split in equal sizes for each bigwifile, i the index, will control which
# one is being plotted
at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
# Plot bigwig
kp <- kpPlotBigWig(kp,
data = bigwig.file,
ymax = "visible.region",
r0 = at$r0,
col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
r1 = at$r1)
computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
# Add track axis
kpAxis(kp,
ymin = 0,
ymax = computed.ymax,
numticks = 2,
r0 = at$r0,
r1 = at$r1,
cex = 0.5)
# Add track label
kpAddLabels(kp,
labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
r0 = at$r0,
r1 = at$r1,
cex = 0.5,
label.margin = 0.01)
}
One of the links identfied using HM450 was the link: cg03326606-PARD6B. Which will be plotted below.
# Load libraries from this section
suppressMessages({
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
})
# karyoploteR options for a better plotting
pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0
# Get probe genomic ranges to plot
cg03326606 <- ELMER:::getInfiniumAnnotation()["cg03326606"]
cg03326606
tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "PARD6B"]
suppressWarnings({
pair.region <- range(c(tssAnnot,cg03326606)) + 200
kp <- plotKaryotype(zoom = pair.region,genome = "hg38", cex = 0.5, plot.params = pp)
genes.data <- makeGenesDataFromTxDb(TxDb.Hsapiens.UCSC.hg38.knownGene,
karyoplot = kp,
plot.transcripts = TRUE,
plot.transcripts.structure = TRUE)
genes.data <- addGeneNames(genes.data)
genes.data <- mergeTranscripts(genes.data)
promoter <- promoters(range(genes.data$transcripts$`84612`),downstream = 0)
kp <- plotKaryotype(zoom = pair.region,genome = "hg38", cex = 0.5, plot.params = pp)
kpPlotRegions(kp, toGRanges("chr20:50729030-50729031"), r0=0, r1=0.02, col="#ff8d92")
kpPlotRegions(kp, promoter, r0 = 0, r1 = 0.02, col="#8d9aff")
kpPlotLinks(kp,
data = toGRanges("chr20:50729030-50729031"),
data2 = promoter,
col = "#fac7ffaa",
r0 = 0.02,
arch.height = 0.1)
kpAddBaseNumbers(kp, tick.dist = 10000, minor.tick.dist = 2000,
add.units = TRUE, cex = 0.5, tick.len = 3)
kpPlotGenes(kp, data = genes.data, r0 = 0.12, r1 = 0.25, gene.name.cex = 0.5)
})
big.wig.files <- dir(path = "Data/ESCA_bigwig_chr20/",
pattern = ".bw",
all.files = T,
full.names = T)
out.at <- autotrack(1:length(big.wig.files),
length(big.wig.files),
margin = 0.3,
r0 = 0.3,
r1 = 1)
kpAddLabels(kp,
labels = "ATAC-seq",
r0 = out.at$r0,
r1 = out.at$r1,
side = "left",
cex = 1,
srt = 90,
pos = 3,
label.margin = 0.1)
for(i in seq_len(length(big.wig.files))) {
bigwig.file <- big.wig.files[i]
at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
kp <- kpPlotBigWig(kp,
data = bigwig.file,
ymax = 523,
r0 = at$r0,
col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
r1 = at$r1)
computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
kpAxis(kp,
ymin = 0,
ymax = computed.ymax,
numticks = 2,
r0 = at$r0,
r1 = at$r1,
cex = 0.5)
kpAddLabels(kp,
labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
r0 = at$r0,
r1 = at$r1,
cex = 1,
label.margin = 0.01)
}
sessionInfo()
We have a set of recorded videos, explaining some of the workshops.