Introduction¶

An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .

Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.

Workshop description¶

In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.

For more information about the data please visit GDC publication website and read the paper:

CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898

Workshop video¶

A recorded video explaining this workshop is available at: https://youtu.be/3ftZecz0lU4. Also, other workshop videos are available: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt.

Pre-requisites¶

Basic knowledge of R syntax

Workshop Participation¶

Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.

Goals and objectives¶

Download and understand the ATAC-Seq data
Compare two different groups of samples ATAC-Seq data

Enviroment: R libraries¶

The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.

# to read txt files
library(readr)

# to transform data into GenomicRanges
library(GenomicRanges)

# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)

# For the t.test loop
library(plyr)

# For easy volcano plot
library(TCGAbiolinks)

# For heatmap plot
library(ComplexHeatmap)
library(circlize)

# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb

Attaching package: ‘tidyr’

The following object is masked from ‘package:S4Vectors’:

    expand


Attaching package: ‘dplyr’

The following objects are masked from ‘package:GenomicRanges’:

    intersect, setdiff, union

The following object is masked from ‘package:GenomeInfoDb’:

    intersect

The following objects are masked from ‘package:IRanges’:

    collapse, desc, intersect, setdiff, slice, union

The following objects are masked from ‘package:S4Vectors’:

    first, intersect, rename, setdiff, setequal, union

The following objects are masked from ‘package:BiocGenerics’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

The following object is masked from ‘package:dplyr’:

    count

Loading required package: BiocParallel

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply

------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: ‘plyr’

The following object is masked from ‘package:matrixStats’:

    count

The following objects are masked from ‘package:dplyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from ‘package:IRanges’:

    desc

The following object is masked from ‘package:S4Vectors’:

    rename

Loading required package: grid
========================================
ComplexHeatmap version 2.1.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
  genomic data. Bioinformatics 2016.
========================================

========================================
circlize version 0.4.8
CRAN page: https://cran.r-project.org/package=circlize
Github page: https://github.com/jokergoo/circlize
Documentation: http://jokergoo.github.io/circlize_book/book/

If you use it in published research, please cite:
Gu, Z. circlize implements and enhances circular visualization 
  in R. Bioinformatics 2014.
========================================

Loading required package: regioneR
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi

Attaching package: 'AnnotationDbi'

The following object is masked from 'package:dplyr':

    select

Data¶

The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.

Understanding the data: peaks sets¶

The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. It is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).

There are mainly two types of ATAC-Seq Counts Matrices raw and normalized which covers mainly two sets of peaks:

"cancer type-specific peak set" containing all of the reproducible peaks observed in an individual cancer type. These peaks were observed in at least two samples with a score per million value $>=5$
"pan-cancer peak set" representing reproducible peaks from all cancer types that could then be used for cross-cancer comparisons

Comparing pan-cancer peak set and cancer type-specific peak set¶

If we check the both sets (Files downloaded from GDC: "All cancer type-specific peak sets. [ZIP]" and "Pan-cancer peak set. [TXT]"), the set of peaks "pan-cancer peak set" consists of $~562K$ peaks, and it contains a subset of each "cancer type-specific peak set". We show an example for Esophageal carcinoma (ESCA) below.

# ESCA specific peaks set
atac_esca <- readr::read_tsv("Data/ESCA_peakCalls.txt", col_types = readr::cols())
head(atac_esca)

# pan-cancer peak set
atac_pan <- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt", col_types = readr::cols())
head(atac_pan)

# from the  pancan set how many belongs to each cancer type?
table(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])

message("How many of the ESCA peaks are the strongest in the pancan")
plyr::count(atac_esca$name %in% atac_pan$name)

plyr::count(grep("ESCA",atac_pan$name,value = T) %in% atac_esca$name)

  ACC  BLCA  BRCA  CESC  CHOL  COAD  ESCA   GBM  HNSC  KIRC  KIRP   LGG  LIHC 
29311 25337 49748 14358 11819 25404 13237 15394 16651 15067 24324 23836 35787 
 LUAD  LUSC  MESO  PCPG  PRAD  SKCM  STAD  TGCT  THCA  UCEC 
23729 22195 22958 31372 30067 36591 17358 26120 31568 20478

How many of the ESCA peaks are the strongest in the pancan

However, it is important to highlight that the "pan-cancer peak set" will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the "pan-cancer peak set" consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.

atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))

Checking an overlapping peak¶

So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the "ACC_10008" peak, which has a higher score.

"ESCA_17603" %in% atac_pan.gr$name

subsetByOverlaps(atac_pan.gr[atac_pan.gr$name == "ACC_10008"],atac_esca.gr)
subsetByOverlaps(atac_esca.gr,atac_pan.gr[atac_pan.gr$name == "ACC_10008"])

GRanges object with 1 range and 4 metadata columns:
      seqnames              ranges strand |        name       score  annotation
         <Rle>           <IRanges>  <Rle> | <character>   <numeric> <character>
  [1]     chr2 112541661-112542162      * |   ACC_10008 22.03057903    Promoter
       percentGC
       <numeric>
  [1] 0.55489022
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

GRanges object with 1 range and 5 metadata columns:
      seqnames              ranges strand |        name            score
         <Rle>           <IRanges>  <Rle> | <character>        <numeric>
  [1]     chr2 112541649-112542150      * |  ESCA_17603 6.25798951741993
       annotation        percentGC        percentAT
      <character>        <numeric>        <numeric>
  [1]    Promoter 0.55688622754491 0.44311377245509
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

Peaks size¶

Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp.

unique(width(atac_pan.gr))
unique(width(atac_esca.gr))

Summary¶

In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.

Analysis: Using ATAC-Seq counts to compare two groups¶

For each set of peaks previously identified, a count matrix was produced. As discribed on the supplemental section "ATAC-seq data analysis – Constructing a counts matrix and normalization":

To obtain the number of independent Tn5 insertions in each peak, first the BAM files were corrected for the Tn5 offset (“+” stranded +4 bp, “-” stranded -5 bp) (16) into a Genomic Ranges object in R using Rsamtools “scanbam”. To get the number of Tn5 insertions per peak, each corrected insertion site (end of a fragment) was counted using “countOverlaps”. This was done for all individual technical replicates and a 562,709 x 796 counts matrix was compiled. From this, a RangedSummarizedExperiment was constructed including peaks as GenomicRanges, a counts matrix, and metadata detailing information for each sample. The counts matrix was then normalized by using edgeR’s “cpm(matrix , log = TRUE, prior.count = 5)” followed by a quantile normalization using preprocessCore’s “normalize.quantiles” in R.

Through the next section we will load the normalized counts data for ESCA and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.

The main file used in this section is included in the following folder.

All cancer type-specific count matrices in normalized counts. [ZIP]

In the code below, we are showing the beginning of the objects. It is important to highlight that the samples are using Stanford UUID instead of TCGA barcodes and each patient normally has two replicates.

atac_esca_norm_ct <- readr::read_tsv("Data/ESCA_log2norm.txt", col_types = readr::cols())
atac_esca_norm_ct[1:4,1:8]

We will change the samples names to TCGA barcodes using the file "Lookup table for various TCGA sample identifiers. [TXT]" from GDC. The file path can be found at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG by copying the URL link to the file, readr::read_tsv downloads and reads the table automatically.

#link from gdc.cancer.gov/about-data/publication/ATACseq-AWG, “Lookup table”
gdc.file <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
samples.ids <- readr::read_tsv(gdc.file, col_types = readr::cols())
head(samples.ids)

Now we will match our Standford UUIDs to TCGA barcodes which might be more meaningful https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/ For that we get the columns names removing non-sample names (seqnames, start, end, name, score) and match them to the bam_prefix column. With the matched index (i.e 1st UUID is the 10th row from samples.ids), we will use it to get the TCGA barcode (Case_ID).

colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac_esca_norm_ct)[-c(1:5)]),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]

Now that we have our matrix, we will create a SummarizedExperiment from it. Which will contain 3 matrices, one with the ATAC-seq values, the metadata of the peaks and the samples metadata.

atac <- atac_esca_norm_ct
non.cts.idx <- 1:5

# We will get the samples metadata from GDC using the TCGAbiolinks packages
# we simply need to give the TCGA barcode, and all the metadata available from that sample will be pull out.
# you can check the first sample metadata at GDC
# https://portal.gdc.cancer.gov/cases/f8dbab24-b9f4-4b8a-bfea-57856ccf6364?bioId=3889c9fe-3777-4b3d-9cf1-12135d4d7f7d
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))

# We will raname Squamous cell carcinoma to ESCC and Adenocarcinoma to ESAD (removing other informations)
head(samples.info)[,c("sample","primary_diagnosis")]
samples.map <- gsub(",| |NOS","",gsub("Adenocarcinoma","ESAD",gsub("Squamous cell carcinoma","ESCC",paste0(samples.info$primary_diagnosis,"-",samples.info$sample))))
colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]

Starting to add information to samples
 => Add clinical information to samples
Add FFPE information. More information at: 
=> https://cancergenome.nih.gov/cancersselected/biospeccriteria 
=> http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html
 => Adding subtype information to samples
esca subtype information from:doi:10.1038/nature20805

Now we have all the data required to create our SummarizedExperiment (SE) object.

# Matrix 1: data
counts <- atac[,-c(1:5)]
head(counts)

# Matrix 2: genomic information
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
rowRanges

GRanges object with 126935 ranges and 2 metadata columns:
                                       seqnames              ranges strand |
                                          <Rle>           <IRanges>  <Rle> |
               ESCA_1_chr1_10180_10679     chr1         10180-10679      * |
             ESCA_2_chr1_180641_181140     chr1       180641-181140      * |
             ESCA_3_chr1_181193_181692     chr1       181193-181692      * |
             ESCA_4_chr1_184246_184745     chr1       184246-184745      * |
             ESCA_5_chr1_267755_268254     chr1       267755-268254      * |
                                   ...      ...                 ...    ... .
  ESCA_126940_chrX_155880541_155881040     chrX 155880541-155881040      * |
  ESCA_126941_chrX_155881050_155881549     chrX 155881050-155881549      * |
  ESCA_126942_chrX_155963006_155963505     chrX 155963006-155963505      * |
  ESCA_126943_chrX_155985136_155985635     chrX 155985136-155985635      * |
  ESCA_126944_chrX_156030008_156030507     chrX 156030008-156030507      * |
                                                  score        name
                                              <numeric> <character>
               ESCA_1_chr1_10180_10679 2.08422038177439      ESCA_1
             ESCA_2_chr1_180641_181140 2.16689026248904      ESCA_2
             ESCA_3_chr1_181193_181692  6.6338010584359      ESCA_3
             ESCA_4_chr1_184246_184745 2.58792851900195      ESCA_4
             ESCA_5_chr1_267755_268254 4.97049203476181      ESCA_5
                                   ...              ...         ...
  ESCA_126940_chrX_155880541_155881040 2.01910337344523 ESCA_126940
  ESCA_126941_chrX_155881050_155881549 15.5822634683249 ESCA_126941
  ESCA_126942_chrX_155963006_155963505 4.76520941542765 ESCA_126942
  ESCA_126943_chrX_155985136_155985635 4.08426076220543 ESCA_126943
  ESCA_126944_chrX_156030008_156030507 2.29289618413069 ESCA_126944
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

# Matrix 3: Samples metadata
# create key for merging
samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
colData <- unique(left_join(samples.info,samples.ids))
head(colData)

Joining, by = "sample"

esca.rse <- SummarizedExperiment(assays=SimpleList(log2norm=as.matrix(counts)),
                                 rowRanges = rowRanges, 
                                 colData = DataFrame(colData))
esca.rse

class: RangedSummarizedExperiment 
dim: 126935 33 
metadata(0):
assays(1): log2norm
rownames(126935): ESCA_1_chr1_10180_10679 ESCA_2_chr1_180641_181140 ...
  ESCA_126943_chrX_155985136_155985635
  ESCA_126944_chrX_156030008_156030507
rowData names(2): score name
colnames(33): ESCC-TCGA-IG-A51D-01A ESCC-TCGA-IG-A51D-01A ...
  ESAD-TCGA-M9-A5M8-01A ESAD-TCGA-M9-A5M8-01A
colData names(149): sample patient ... Case_UUID Case_ID

Since we have two samples for each patient we will rename tham as rep1 and rep2

duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")
colnames(esca.rse)

Comparing ESCC vs ESAD ATAC-Seq¶

A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.

escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")

# We will use 2 cores to run the code
library(doParallel)
registerDoParallel(2)

# Time expected ~ 3 min
result <- plyr::adply(assay(esca.rse),.margins = 1,.fun = function(peak){
  results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
  return(tibble::tibble("raw_p_value"= results$p.value,
                        "ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2]))
}, .progress = "time", .id = "peak", .parallel = TRUE)
result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")

Loading required package: foreach
Loading required package: iterators
Progress disabled when using parallel plyr

Volcano plot of t-test analysis¶

It is possible to visualize the results of a t.test as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks. In this example we will use the $FDR < 0.01$ and $\Delta Log_2 Counts > 2$.

We will be using TCGAbiolinks package to plot it, but you can do it using ggplot2 or plotly for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.

fdr.cut.off <- 0.01
diff.cut.off <- 2

TCGAbiolinks:::TCGAVisualize_volcano(x = result$ESCC_minus_ESAD,
                                     y = result$FDR, 
                                     title =  paste0("Volcano plot - ATAC-seq peaks ",
                                                     "difference in ", 
                                                     "ESCC vs ESAD\n"),
                                     filename = NULL,
                                     label =  c("Not Significant",
                                                paste0("High in ESCC (vs ESAD)"),
                                                paste0("Low in ESCC (vs ESAD)")),
                                     ylab =  expression(paste(-Log[10],
                                                              " (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
                                     )),
                                     xlab =  expression(paste(
                                       "Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
                                     )),
                                     x.cut = diff.cut.off, 
                                     y.cut = fdr.cut.off)

# How many peaks pass our cut-offs
message(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off))

4067

Heatmap of differential significant peaks¶

First, we will load the libraries used to plot the heatmap.

library(ComplexHeatmap)
library(circlize)

# Colors of the heatmap
pal_atac <- colorRampPalette(c('#3361A5',
                               '#248AF3',
                               '#14B3FF',
                               '#88CEEF',
                               '#C1D5DC',
                               '#EAD397',
                               '#FDB31A',
                               '#E42A2A',
                               '#A31D1D'))(100)

# Upper track with the samples annotation
ha = HeatmapAnnotation(df = data.frame("Group" = esca.rse$primary_diagnosis, 
                                       "Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")),
                       show_annotation_name = T,
                       col = list(Group = c("Squamous cell carcinoma, NOS" =  "red", 
                                            "Adenocarcinoma, NOS" = "blue")),
                       show_legend = T,
                       annotation_name_side = "left",
                       annotation_name_gp = gpar(fontsize = 6))

# Select significant peals to plot
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]

# Define the color scale
col <- colorRamp2(seq(min(plot.atac), max(plot.atac), 
                      by = (max(plot.atac) - min(plot.atac))/99), pal_atac)

# Show the names of the peaks 1 and 18
rows.annot <- rowAnnotation(foo = anno_mark(at = c(1,18), labels = rownames(plot.atac)[c(1,18)]))


# Plot the ATAC-Seq signals
ht_list <- 
  Heatmap(plot.atac,
          name = "ATAC-seq log2(counts)", 
          col = col,
          column_names_gp = gpar(fontsize = 8),
          show_column_names = F,
          heatmap_legend_param = list(legend_direction = "horizontal",
                                      labels_gp = gpar(fontsize = 12), 
                                      title_gp = gpar(fontsize = 12)),
          show_row_names = FALSE,
          cluster_columns = TRUE,
          use_raster = TRUE,
          raster_device = c("png"),
          raster_quality = 2,
          cluster_rows = T,
          right_annotation = rows.annot,
          row_title = paste0(sum(result$FDR < fdr.cut.off & 
                                   abs(result$ESCC_minus_ESAD) > diff.cut.off),
                             " ATAC-seq peaks"),
          row_names_gp = gpar(fontsize = 4),
          top_annotation = ha,
          column_title_gp = gpar(fontsize = 12), 
          row_title_gp = gpar(fontsize = 12)) 

options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE, 
     column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
                           ",  Diff mean log2 Count > ",diff.cut.off,")"),
     column_title_gp = gpar(fontsize = 12, fontface = "bold"),
     heatmap_legend_side = "bottom",
     annotation_legend_side = "right")

Heatmap of differential significant peaks (z-score)¶

A better way to visualize a heatmap is using the z-score transformation on the rows. Z-scores are centered and normalized, so the user can interpret a color as x standard deviations from the mean and have an intuitive idea of the relative variation of that value. This will make the visibility of the heatmap better since it will reduce the range of the values plots. For more information, please read the discussion here.

In R the function scale can be used, since it works by column we have to transpose the matrix so it is applied to the peaks instead of the samples and then transpose it back.

# Start to plot z-score heatmap
plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score

# Recreate color scheme based on the z-score levels we will truncate it from -2 to 2.
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)


ht_list <- 
  Heatmap(plot.atac.row.z.score,
          name = "Row z-score (ATAC-seq log2(counts))", 
          col = col.zscore,
          column_names_gp = gpar(fontsize = 8),
          show_column_names = F,
          heatmap_legend_param = list(legend_direction = "horizontal",
                                      labels_gp = gpar(fontsize = 12), 
                                      title_gp = gpar(fontsize = 12)),
          show_row_names = FALSE,
          cluster_columns = TRUE,
          use_raster = TRUE,
          right_annotation = rows.annot,
          raster_device = c("png"),
          raster_quality = 2,
          cluster_rows = T,
          row_title = paste0(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off),
                             " ATAC-seq peaks"),
          #column_order = cols.order,
          row_names_gp = gpar(fontsize = 4),
          top_annotation = ha,
          #width = unit(15, "cm"),
          #column_title = paste0("RNA-seq z-score (n = ", ncol(plot.exp),")"), 
          column_title_gp = gpar(fontsize = 12), 
          row_title_gp = gpar(fontsize = 12)) 

options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE, 
     column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", 
                           fdr.cut.off,",  Diff mean log2 Count > ",
                           diff.cut.off,")"),
     column_title_gp = gpar(fontsize = 12, fontface = "bold"),
     heatmap_legend_side = "bottom",
     annotation_legend_side = "right")

Merging ATAC-Seq replicates¶

If you want to instead of plot all replicates, to plot a single value for each patient you can get the mean of the values.

# This function will calculate the Means of the peaks for a given group
# in our case we will calculate the mean of the replicates of each patient.
groupMeans <- function(mat, groups = NULL, na.rm = TRUE){
  stopifnot(!is.null(groups))
  gm <- lapply(unique(groups), function(x){
    rowMeans(mat[,which(groups == x),drop = F], na.rm=na.rm)
  }) %>% Reduce("cbind",.)
  colnames(gm) <- unique(groups)
  return(gm)
}
matMerged <- groupMeans(mat = assays(esca.rse)$log2norm, groups = colData(esca.rse)$sample)

# keep only metadata for replicate 1
metadata <- colData(esca.rse)[grep("rep1",rownames(colData(esca.rse))),]


# Create the upper annotation track for the samples
ha = HeatmapAnnotation(df = data.frame("Group" = metadata$primary_diagnosis),
                       show_annotation_name = TRUE,
                       col = list(Group = c("Squamous cell carcinoma, NOS" =  "red", 
                                            "Adenocarcinoma, NOS" = "blue")),
                       show_legend = TRUE,
                       annotation_name_side = "left",
                       annotation_name_gp = gpar(fontsize = 6))

# Select the significant peaks to be plotted
plot.atac <- matMerged[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]

# Define the color scheme based on the values
col <- colorRamp2(seq(min(plot.atac), max(plot.atac), 
                      by = (max(plot.atac) - min(plot.atac))/99), pal_atac)


# Plot ATAC-Seq signal values
ht_list <- 
  Heatmap(plot.atac,
          name = "ATAC-Seq log2(counts)", 
          col = col,
          column_names_gp = gpar(fontsize = 8),
          show_column_names = F,
          heatmap_legend_param = list(legend_direction = "horizontal",
                                      labels_gp = gpar(fontsize = 12), 
                                      title_gp = gpar(fontsize = 12)),
          show_row_names = FALSE,
          cluster_columns = TRUE,
          use_raster = TRUE,
          raster_device = c("png"),
          raster_quality = 2,
          cluster_rows = T,
          right_annotation = rows.annot,
          row_title = paste0(sum(result$FDR < fdr.cut.off & 
                                   abs(result$ESCC_minus_ESAD) > diff.cut.off),
                             " ATAC-seq peaks"),
          row_names_gp = gpar(fontsize = 4),
          top_annotation = ha,
          column_title_gp = gpar(fontsize = 12), 
          row_title_gp = gpar(fontsize = 12)) 

options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE, 
     column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,",  
                           Diff mean log2 Count > ",diff.cut.off,")"),
     column_title_gp = gpar(fontsize = 12, fontface = "bold"),
     heatmap_legend_side = "bottom",
     annotation_legend_side = "right")


# Start to plot z-score heatmap
plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score

# Recreate color scheme based on the z-score levels we will truncate it from -2 to 2.
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)

ht_list <- 
  Heatmap(plot.atac.row.z.score,
          name = "Row z-score (ATAC-seq log2(counts))", 
          col = col.zscore,
          column_names_gp = gpar(fontsize = 8),
          show_column_names = F,
          heatmap_legend_param = list(legend_direction = "horizontal",
                                      labels_gp = gpar(fontsize = 12), 
                                      title_gp = gpar(fontsize = 12)),
          show_row_names = FALSE,
          cluster_columns = TRUE,
          use_raster = TRUE,
          right_annotation = rows.annot,
          raster_device = c("png"),
          raster_quality = 2,
          cluster_rows = T,
          row_title = paste0(sum(result$FDR < fdr.cut.off & 
                                   abs(result$ESCC_minus_ESAD) > diff.cut.off),
                             " ATAC-seq peaks"),
          row_names_gp = gpar(fontsize = 4),
          top_annotation = ha,
          column_title_gp = gpar(fontsize = 12), 
          row_title_gp = gpar(fontsize = 12)) 

options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE, 
     column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
                           ",  Diff mean log2 Count > ",diff.cut.off,")"),
     column_title_gp = gpar(fontsize = 12, fontface = "bold"),
     heatmap_legend_side = "bottom",
     annotation_legend_side = "right")

ATAC-Seq Bigwig¶

The ATAC-Seq bigwig files available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.

Here is some information about the bigwig files.

All bigWig files for each cancer type are compressed using tar and gzip. As such, each of the .tgz files contains all of the individual bigWig files for each technical replicate.

We recommend extracting the files using the following command: tar -zxvf file_name.tgz --strip-components 8 where the "--strip-components 8" extracts the files without copying their original directory structure

The provided bigWig files have been normalized by the total insertions in peaks and then binned into 100-bp bins. Each 100-bp bin represents the normalized number of insertions that occurred within the corresponding 100 bp.

The bigwig names also use Stanford UUIDs. The script below will help to rename the bigwifiles with TCGA barcodes. First, we get the path to the downloaded bigwig files after uncompressing them and read the information file from the ATAC-Seq website.

bigwig.files <- dir(path = "ATAC-seq_data/ESCA_bigWigs/",
                    pattern = ".bw",
                    all.files = T,
                    full.names = T)
table <- readr::read_tsv("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c")

plyr::a_ply(bigwig.files,1, function(file) {

  file.uuid <- stringr::str_extract(file,
  "[:alnum:]{8}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{12}")

  idx <- grep(file.uuid,gsub("-","_",table$stanfordUUID))

  barcode <- unique(table[idx,]$Case_ID)

  if(grepl("ESCA",file)){
    samples.info <- TCGAbiolinks:::colDataPrepare(barcode)
    barcode <- gsub(",| |NOS","",
                        gsub("Adenocarcinoma","ESAD",
                          gsub("Squamous cell carcinoma","ESCC",
                            paste0(samples.info$primary_diagnosis,"-",
                                   samples.info$sample)
                          )
                        )
                    )
  }
  # change UUID to barcode
  to <- gsub(file.uuid,barcode,file)
  file.rename(file, to)
})

Since loading several bigWigs might be pretty slow in software like IGV users might want to reduce the bigwig files to a single chromosome (i.e. chr20). The Rscript below can do it by transforming the bigWig to a wig with only chr20 then converting the wig back to bigWig.

You can download at the executable for bigWigToWig and wigToBigWig at ENCODE (http://hgdownload.cse.ucsc.edu/admin/exe/) and the hg38.chrom.sizes is available at GitHub (https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes)

{R}
chr <- 20
dirout <- paste0("chr",chr)
dir.create(dirout)
files <- dir(path = ".",pattern = "bw",full.names = T)
for(f in files){
  f.in <- f
  f.out <- gsub("bw","wig",f)
  f.out.chr <- file.path(dirout,gsub("\\.bw",paste0("_chr",chr,".bw"),f))
  cmd <- paste0("bigWigToWig -chrom=chr",chr," ", f.in," ", f.out)
  system(cmd)
  cmd <- paste0("wigToBigWig ", f.out," hg38.chrom.sizes ", f.out.chr)
  system(cmd)
}

Visualizing the bigwig files in R¶

We upload some samples (chromosome 20 only) at the google drive that can be plotted in R using the karyoploteR package, which main documentation can be found at https://bernatgel.github.io/karyoploter_tutorial/.

# Load required libraries
suppressMessages({
    library(karyoploteR)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
})

# Plot parameters, only to look better
pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0


# Get transcrupts annotation to get HNF4A regions
tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "HNF4A"]

# plot will be at  the HNF4A range +- 50Kb
HNF4A.region <- range(c(tssAnnot)) + 50000

# Start by plotting gene tracks
kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex = 0.5, plot.params = pp)
genes.data <- makeGenesDataFromTxDb(TxDb.Hsapiens.UCSC.hg38.knownGene,
                                    karyoplot = kp,
                                    plot.transcripts = TRUE, 
                                    plot.transcripts.structure = TRUE)
genes.data <- addGeneNames(genes.data)
genes.data <- mergeTranscripts(genes.data)


kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex = 0.5, plot.params = pp)
kpAddBaseNumbers(kp, tick.dist = 20000, minor.tick.dist = 5000,
                 add.units = TRUE, cex = 0.4, tick.len = 3)
kpPlotGenes(kp, data = genes.data, r0 = 0, r1 = 0.25, gene.name.cex = 0.5)


# Start to plot bigwig files
big.wig.files <- dir(path = "Data/ESCA_bigwig_chr20/",
                    pattern = ".bw",
                    all.files = T,
                    full.names = T)
big.wig.files

# Reserve area to plot the bigwig files
out.at <- autotrack(1:length(big.wig.files), 
                    length(big.wig.files), 
                    margin = 0.3, 
                    r0 = 0.3,
                    r1 = 1)

# Add ATAC-seq label from 0.3 to 1 which should cover all ATAC-seq tracks
kpAddLabels(kp, 
            labels = "ATAC-Seq", 
            r0 = out.at$r0, 
            r1 = out.at$r1, 
            side = "left",
            cex = 1,
            srt = 90, 
            pos = 3, 
            label.margin = 0.1)


for(i in seq_len(length(big.wig.files))) {
  bigwig.file <- big.wig.files[i]
  
  # Define where the track will be ploted
  # autotrack will simple get the reserved space (from out.at$r0 up to out.at$r1)
  # and split in equal sizes for each bigwifile, i the index, will control which 
  # one is being plotted
  at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
  
  # Plot bigwig
  kp <- kpPlotBigWig(kp, 
                     data = bigwig.file, 
                     ymax = "visible.region", 
                     r0 = at$r0, 
                     col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
                     r1 = at$r1)
  computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
  
  # Add track axis
  kpAxis(kp, 
         ymin = 0, 
         ymax = computed.ymax, 
         numticks = 2,
         r0 = at$r0, 
         r1 = at$r1,
         cex = 0.5)
  
  # Add track label
  kpAddLabels(kp, 
              labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
              r0 = at$r0, 
              r1 = at$r1, 
              cex = 0.5, 
              label.margin = 0.01)
}

Loading required package: org.Hs.eg.db

Visualizing links¶

One of the links identfied using HM450 was the link: cg03326606-PARD6B. Which will be plotted below.

# Load libraries from this section
suppressMessages({
    library(karyoploteR)
    library(TxDb.Hsapiens.UCSC.hg38.knownGene)
})

# karyoploteR options for a better plotting
pp <- getDefaultPlotParams(plot.type = 1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0


# Get probe genomic ranges to plot
cg03326606 <- ELMER:::getInfiniumAnnotation()["cg03326606"]
cg03326606

tssAnnot <- ELMER::getTSS(genome = "hg38")
tssAnnot <- tssAnnot[tssAnnot$external_gene_name == "PARD6B"]


suppressWarnings({
pair.region <- range(c(tssAnnot,cg03326606)) + 200

kp <- plotKaryotype(zoom = pair.region,genome = "hg38", cex = 0.5, plot.params = pp)


genes.data <- makeGenesDataFromTxDb(TxDb.Hsapiens.UCSC.hg38.knownGene,
                                    karyoplot = kp,
                                    plot.transcripts = TRUE, 
                                    plot.transcripts.structure = TRUE)
genes.data <- addGeneNames(genes.data)


genes.data <- mergeTranscripts(genes.data)

promoter <- promoters(range(genes.data$transcripts$`84612`),downstream = 0)

kp <- plotKaryotype(zoom = pair.region,genome = "hg38", cex = 0.5, plot.params = pp)
kpPlotRegions(kp, toGRanges("chr20:50729030-50729031"), r0=0, r1=0.02, col="#ff8d92")
kpPlotRegions(kp, promoter, r0 = 0, r1 = 0.02, col="#8d9aff")

kpPlotLinks(kp, 
            data = toGRanges("chr20:50729030-50729031"), 
            data2 = promoter, 
            col = "#fac7ffaa", 
            r0 = 0.02,
            arch.height = 0.1)
kpAddBaseNumbers(kp, tick.dist = 10000, minor.tick.dist = 2000,
                 add.units = TRUE, cex = 0.5, tick.len = 3)
kpPlotGenes(kp, data = genes.data, r0 = 0.12, r1 = 0.25, gene.name.cex = 0.5)
})

big.wig.files <- dir(path = "Data/ESCA_bigwig_chr20/",
                    pattern = ".bw",
                    all.files = T,
                    full.names = T)

out.at <- autotrack(1:length(big.wig.files), 
                    length(big.wig.files), 
                    margin = 0.3, 
                    r0 = 0.3,
                    r1 = 1)

kpAddLabels(kp, 
            labels = "ATAC-seq", 
            r0 = out.at$r0, 
            r1 = out.at$r1, 
            side = "left",
            cex = 1,
            srt = 90, 
            pos = 3, 
            label.margin = 0.1)

for(i in seq_len(length(big.wig.files))) {
  bigwig.file <- big.wig.files[i]
  at <- autotrack(i, length(big.wig.files), r0 = out.at$r0, r1 = out.at$r1, margin = 0.2)
  kp <- kpPlotBigWig(kp, 
                     data = bigwig.file, 
                     ymax = 523,
                     r0 = at$r0, 
                     col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
                     r1 = at$r1)
  computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
  kpAxis(kp, 
         ymin = 0, 
         ymax = computed.ymax, 
         numticks = 2,
         r0 = at$r0, 
         r1 = at$r1,
         cex = 0.5)
  kpAddLabels(kp, 
              labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
              r0 = at$r0, 
              r1 = at$r1, 
              cex = 1, 
              label.margin = 0.01)
}

GRanges object with 1 range and 52 metadata columns:
             seqnames            ranges strand | address_A address_B
                <Rle>         <IRanges>  <Rle> | <integer> <integer>
  cg03326606    chr20 50729030-50729031      + |  35670461      <NA>
                 channel  designType    nextBase nextBaseRef   probeType
             <character> <character> <character> <character> <character>
  cg03326606        Both          II         G/A           C          cg
             orientation probeCpGcnt context35  probeBeg  probeEnd
             <character>   <integer> <integer> <integer> <numeric>
  cg03326606          up           1         2  50728981  50729030
                                                     ProbeSeq_A  ProbeSeq_B
                                                    <character> <character>
  cg03326606 TTTATCAAATTCCAACTATTTTCTACTTTTCTTCCTTATACAACCCAAAC            
                    gene   gene_HGNC      chrm_A     beg_A    flag_A    mapQ_A
             <character> <character> <character> <integer> <integer> <integer>
  cg03326606        <NA>        <NA>       chr20  50728981         0        60
                 cigar_A      NM_A      chrm_B     beg_B    flag_B    mapQ_B
             <character> <integer> <character> <integer> <integer> <integer>
  cg03326606         50M         0        <NA>      <NA>      <NA>      <NA>
                 cigar_B      NM_B wDecoy_chrm_A wDecoy_beg_A wDecoy_flag_A
             <character> <integer>   <character>    <integer>     <integer>
  cg03326606        <NA>      <NA>         chr20     50728981             0
             wDecoy_mapQ_A wDecoy_cigar_A wDecoy_NM_A wDecoy_chrm_B
                 <integer>    <character>   <integer>   <character>
  cg03326606            60            50M           0          <NA>
             wDecoy_beg_B wDecoy_flag_B wDecoy_mapQ_B wDecoy_cigar_B
                <integer>     <integer>     <integer>    <character>
  cg03326606         <NA>          <NA>          <NA>           <NA>
             wDecoy_NM_B  posMatch MASK_mapping MASK_typeINextBaseSwitch
               <integer> <logical>    <logical>                <logical>
  cg03326606        <NA>      <NA>        FALSE                    FALSE
             MASK_rmsk15 MASK_sub40_copy MASK_sub35_copy MASK_sub30_copy
               <logical>       <logical>       <logical>       <logical>
  cg03326606       FALSE           FALSE           FALSE           FALSE
             MASK_sub25_copy MASK_snp5_common MASK_snp5_GMAF1p MASK_extBase
                   <logical>        <logical>        <logical>    <logical>
  cg03326606           FALSE            FALSE            FALSE        FALSE
             MASK_general
                <logical>
  cg03326606        FALSE
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

Session information¶

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.7.0                     
 [2] doParallel_1.0.15                      
 [3] iterators_1.0.12                       
 [4] foreach_1.4.7                          
 [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
 [6] GenomicFeatures_1.34.8                 
 [7] AnnotationDbi_1.44.0                   
 [8] karyoploteR_1.11.5                     
 [9] regioneR_1.14.0                        
[10] circlize_0.4.8                         
[11] ComplexHeatmap_2.1.0                   
[12] TCGAbiolinks_2.13.6                    
[13] plyr_1.8.4                             
[14] SummarizedExperiment_1.12.0            
[15] DelayedArray_0.8.0                     
[16] BiocParallel_1.16.6                    
[17] matrixStats_0.55.0                     
[18] Biobase_2.42.0                         
[19] dplyr_0.8.3                            
[20] tidyr_1.0.0                            
[21] GenomicRanges_1.34.0                   
[22] GenomeInfoDb_1.18.2                    
[23] IRanges_2.16.0                         
[24] S4Vectors_0.20.1                       
[25] BiocGenerics_0.28.0                    
[26] readr_1.3.1                            

loaded via a namespace (and not attached):
  [1] R.utils_2.9.0               tidyselect_0.2.5           
  [3] RSQLite_2.1.2               htmlwidgets_1.3            
  [5] ELMER_2.9.4                 DESeq_1.34.1               
  [7] munsell_0.5.0               codetools_0.2-16           
  [9] pbdZMQ_0.3-3                colorspace_1.4-1           
 [11] knitr_1.24                  uuid_0.1-2                 
 [13] rstudioapi_0.10             ggsignif_0.6.0             
 [15] labeling_0.3                repr_1.0.1                 
 [17] GenomeInfoDbData_1.2.1      hwriter_1.3.2              
 [19] KMsurv_0.1-5                bit64_0.9-7                
 [21] downloader_0.4              vctrs_0.2.0                
 [23] generics_0.0.2              xfun_0.9                   
 [25] biovizBase_1.30.1           ggthemes_4.2.0             
 [27] EDASeq_2.16.3               ELMER.data_2.9.3           
 [29] R6_2.4.0                    clue_0.3-57                
 [31] locfit_1.5-9.1              AnnotationFilter_1.6.0     
 [33] bitops_1.0-6                reshape_0.8.8              
 [35] assertthat_0.2.1            scales_1.0.0               
 [37] nnet_7.3-12                 gtable_0.3.0               
 [39] sva_3.30.1                  ensembldb_2.6.3            
 [41] rlang_0.4.0                 zeallot_0.1.0              
 [43] genefilter_1.64.0           GlobalOptions_0.1.0        
 [45] splines_3.5.1               rtracklayer_1.42.2         
 [47] lazyeval_0.2.2              acepack_1.4.1              
 [49] dichromat_2.0-0             selectr_0.4-1              
 [51] broom_0.5.2                 checkmate_1.9.4            
 [53] backports_1.1.4             Hmisc_4.2-0                
 [55] tools_3.5.1                 ggplot2_3.2.1              
 [57] RColorBrewer_1.1-2          MultiAssayExperiment_1.8.1 
 [59] Rcpp_1.0.2                  base64enc_0.1-3            
 [61] progress_1.2.2              zlibbioc_1.28.0            
 [63] purrr_0.3.2                 RCurl_1.95-4.12            
 [65] prettyunits_1.0.2           ggpubr_0.2.3               
 [67] rpart_4.1-15                GetoptLong_0.1.7           
 [69] zoo_1.8-6                   ggrepel_0.8.1              
 [71] cluster_2.1.0               magrittr_1.5               
 [73] data.table_1.12.2           survminer_0.4.6            
 [75] ProtGenerics_1.14.0         aroma.light_3.12.0         
 [77] hms_0.5.1                   evaluate_0.14              
 [79] xtable_1.8-4                XML_3.98-1.19              
 [81] gridExtra_2.3               shape_1.4.4                
 [83] compiler_3.5.1              biomaRt_2.38.0             
 [85] tibble_2.1.3                crayon_1.3.4               
 [87] R.oo_1.22.0                 htmltools_0.3.6            
 [89] mgcv_1.8-28                 Formula_1.2-3              
 [91] geneplotter_1.60.0          DBI_1.0.0                  
 [93] matlab_1.0.2                ShortRead_1.40.0           
 [95] Matrix_1.2-17               R.methodsS3_1.7.1          
 [97] Gviz_1.26.4                 pkgconfig_2.0.2            
 [99] km.ci_0.5-2                 GenomicAlignments_1.18.1   
[101] foreign_0.8-72              IRdisplay_0.7.0            
[103] plotly_4.9.0                xml2_1.2.2                 
[105] annotate_1.60.1             XVector_0.22.0             
[107] rvest_0.3.4                 stringr_1.4.0              
[109] bezier_1.1.2                VariantAnnotation_1.28.3   
[111] digest_0.6.20               ConsensusClusterPlus_1.46.0
[113] Biostrings_2.50.2           rmarkdown_1.15             
[115] survMisc_0.5.5              htmlTable_1.13.1           
[117] edgeR_3.24.3                curl_4.1                   
[119] Rsamtools_1.34.1            rjson_0.2.20               
[121] lifecycle_0.1.0             nlme_3.1-141               
[123] jsonlite_1.6                viridisLite_0.3.0          
[125] limma_3.38.3                BSgenome_1.50.0            
[127] pillar_1.4.2                lattice_0.20-38            
[129] httr_1.4.1                  survival_2.44-1.1          
[131] glue_1.3.1                  bamsignals_1.14.0          
[133] png_0.1-7                   bit_1.1-14                 
[135] stringi_1.4.3               blob_1.2.0                 
[137] latticeExtra_0.6-28         memoise_1.1.0              
[139] IRkernel_1.0.2

Workshop materials¶

Workshops HTMLs¶

ELMER data Workshop HTML: http://rpubs.com/tiagochst/elmer-data-workshop-2019
ELMER analysis Workshop HTML: http://rpubs.com/tiagochst/ELMER_workshop
ATAC-seq Workshop HTML: http://rpubs.com/tiagochst/atac_seq_workshop

Workshop videos¶

We have a set of recorded videos, explaining some of the workshops.

All videos playlist: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt
ELMER algorithm: https://youtu.be/PzC31K9vfu0
ELMER data: https://youtu.be/R00wG--tGo8
ELMER analysis part1 : https://youtu.be/bcd4uyxrZCw
ELMER analysis part2: https://youtu.be/vcJ_DSCt4Mo
ELMER summarizing several analysis: https://youtu.be/moLeik7JjLk
ATAC-Seq workshop: https://youtu.be/3ftZecz0lU4

seqnames	start	end	name	score	annotation	percentGC	percentAT
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<chr>	<dbl>	<dbl>
chr1	1290095	1290596	ESCA_107	2.464378	3' UTR	0.6766467	0.3233533
chr1	1291115	1291616	ESCA_108	2.587929	3' UTR	0.7025948	0.2974052
chr1	1291753	1292254	ESCA_109	7.579962	3' UTR	0.6387226	0.3612774
chr1	1440824	1441325	ESCA_160	4.467274	3' UTR	0.6586826	0.3413174
chr1	1630188	1630689	ESCA_179	20.621324	3' UTR	0.7285429	0.2714571
chr1	2030218	2030719	ESCA_341	15.572581	3' UTR	0.6187625	0.3812375

seqnames	start	end	name	score	annotation	percentGC
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<chr>	<dbl>
chr1	906012	906513	ACC_10	7.171193	Intron	0.6127745
chr2	112541661	112542162	ACC_10008	22.030579	Promoter	0.5548902
chr1	21673421	21673922	ACC_1001	6.459954	Distal	0.5089820
chr2	112584205	112584706	ACC_10013	43.208555	Promoter	0.5868263
chr2	112596243	112596744	ACC_10016	5.428209	Intron	0.4910180
chr1	21725692	21726193	ACC_1002	5.201273	Intron	0.4051896

x	freq
<lgl>	<int>
FALSE	113818
TRUE	13237

x	freq
<lgl>	<int>
TRUE	13237

bam_prefix	stanfordUUID	aliquot_id	Case_UUID	Case_ID
<chr>	<chr>	<chr>	<chr>	<chr>
BRCA-000CFD9F-ADDF-4304-9E60-6041549E189C-X017-S06-L011-B1-T1-P040	000CFD9F-ADDF-4304-9E60-6041549E189C	TCGA-A7-A13F-01A-31-A615-42	2cf68894-168b-458b-af4f-53cad72989a8	TCGA-A7-A13F-01A-31-A615-42
BRCA-000CFD9F-ADDF-4304-9E60-6041549E189C-X017-S06-L012-B1-T2-P046	000CFD9F-ADDF-4304-9E60-6041549E189C	TCGA-A7-A13F-01A-31-A615-42	2cf68894-168b-458b-af4f-53cad72989a8	TCGA-A7-A13F-01A-31-A615-42
PCPG-007124EC-1F9B-4FCB-BC6E-DB8C25FD9146-X033-S03-L098-B1-T1-P073	007124EC-1F9B-4FCB-BC6E-DB8C25FD9146	TCGA-RM-A68W-01A-31-A644-42	1a1cf490-8bd4-4a99-bf3a-34f06435de86	TCGA-RM-A68W-01A-31-A644-42
PCPG-007124EC-1F9B-4FCB-BC6E-DB8C25FD9146-X033-S03-L100-B1-T2-P077	007124EC-1F9B-4FCB-BC6E-DB8C25FD9146	TCGA-RM-A68W-01A-31-A644-42	1a1cf490-8bd4-4a99-bf3a-34f06435de86	TCGA-RM-A68W-01A-31-A644-42
STAD-00DFAA4D-DE64-4476-9546-18E728653046-X029-S06-L011-B1-T1-P072	00DFAA4D-DE64-4476-9546-18E728653046	TCGA-BR-A4J1-01A-31-A646-42	e9a98a44-83f2-490c-b053-1e953ebd4e7e	TCGA-BR-A4J1-01A-31-A646-42
STAD-00DFAA4D-DE64-4476-9546-18E728653046-X029-S06-L012-B1-T2-P077	00DFAA4D-DE64-4476-9546-18E728653046	TCGA-BR-A4J1-01A-31-A646-42	e9a98a44-83f2-490c-b053-1e953ebd4e7e	TCGA-BR-A4J1-01A-31-A646-42

	sample	primary_diagnosis
	<chr>	<chr>
TCGA-IG-A51D-01A-31-A616-42	TCGA-IG-A51D-01A	Squamous cell carcinoma, NOS
TCGA-LN-A9FQ-01A-11-A617-42	TCGA-LN-A9FQ-01A	Squamous cell carcinoma, NOS
TCGA-L7-A6VZ-01A-31-A616-42	TCGA-L7-A6VZ-01A	Adenocarcinoma, NOS
TCGA-L5-A4OE-01A-31-A616-42	TCGA-L5-A4OE-01A	Adenocarcinoma, NOS
TCGA-LN-A49O-01A-31-A616-42	TCGA-LN-A49O-01A	Squamous cell carcinoma, NOS
TCGA-IG-A7DP-01A-21-A616-42	TCGA-IG-A7DP-01A	Adenocarcinoma, NOS

A tibble: 4 × 8
seqnames	start	end	name	score	ESCA_1E6AC686_96C2_46C4_A451_12D30114DD06_X012_S02_L027_B1_T1_P024	ESCA_1E6AC686_96C2_46C4_A451_12D30114DD06_X012_S02_L028_B1_T2_P026	ESCA_1FEF5E19_2351_4D46_99B4_63F0637ABF17_X010_S08_L063_B1_T1_P016
<chr>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
chr1	10180	10679	ESCA_1	2.084220	1.440835	1.220656	1.9305615
chr1	180641	181140	ESCA_2	2.166890	2.545305	2.229090	2.2083821
chr1	181193	181692	ESCA_3	6.633801	2.096789	1.978599	2.0173702
chr1	184246	184745	ESCA_4	2.587929	1.440835	1.158167	-0.2450712

A tibble: 6 × 33
ESCC-TCGA-IG-A51D-01A	ESCC-TCGA-IG-A51D-01A	ESCC-TCGA-LN-A9FQ-01A	ESCC-TCGA-LN-A9FQ-01A	ESAD-TCGA-L7-A6VZ-01A	ESAD-TCGA-L7-A6VZ-01A	ESAD-TCGA-L5-A4OE-01A	ESAD-TCGA-L5-A4OE-01A	ESCC-TCGA-LN-A49O-01A	ESCC-TCGA-LN-A49O-01A	⋯	ESCC-TCGA-LN-A4A2-01A	ESCC-TCGA-LN-A4MR-01A	ESCC-TCGA-IG-A625-01A	ESCC-TCGA-IG-A625-01A	ESCC-TCGA-LN-A49W-01A	ESCC-TCGA-LN-A49W-01A	ESAD-TCGA-IC-A6RE-01A	ESAD-TCGA-IC-A6RE-01A	ESAD-TCGA-M9-A5M8-01A	ESAD-TCGA-M9-A5M8-01A
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	⋯	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1.440835	1.220656	1.9305615	1.3856515	1.419756	1.2865314	1.317699	1.4211594	0.92281439	0.9143583	⋯	2.7565289	1.0451921	1.837044	2.1318391	1.7743853	1.702127	1.991984	2.134686	1.499836	1.7181801
2.545305	2.229090	2.2083821	2.6995197	1.096847	0.8068271	2.069828	1.8467054	2.51514795	1.3909241	⋯	1.6890171	1.0968471	2.282268	2.3466892	2.1695442	2.214447	1.918999	1.810474	0.936399	1.6480467
2.096789	1.978599	2.0173702	2.2231698	2.046083	1.7816299	2.962372	1.8467054	2.30934956	2.2857047	⋯	2.4644978	0.4274280	1.713514	2.3109811	2.7399164	2.915556	3.661431	3.738616	3.738182	4.0028779
1.440835	1.158167	-0.2450712	0.8370665	2.669934	2.7365554	1.096847	0.9080052	0.02756702	-0.1840065	⋯	0.9603146	-0.8232581	1.117082	0.5849464	0.5613877	1.047300	1.834993	1.571171	1.844719	0.4924719
1.337133	1.540553	3.1314019	3.1264294	1.994056	1.6242283	1.317699	0.7580145	1.78351492	1.1104350	⋯	1.6218309	1.7588014	2.252158	2.1729770	1.7021270	1.419756	2.233044	2.112490	1.041306	0.9972153
5.476800	5.351221	4.0070949	4.0423047	5.029988	4.9863520	5.068612	4.9290961	4.43081814	4.7192657	⋯	5.7945555	4.5463322	4.793764	4.7346741	5.1384999	5.022852	5.440720	5.438686	5.296503	5.0908551

A data.frame: 6 × 149
sample	patient	barcode	shortLetterCode	definition	year_of_diagnosis	classification_of_tumor	last_known_disease_status	updated_datetime.x	primary_diagnosis	⋯	subtype_GEA.CIN.Integrated.Cluster...COCA	subtype_GEA.CIN.Integrated.Cluster...iCluster	subtype_GEA.CIN.Integrated.Cluster...SuperCluster	subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.4	subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.7	bam_prefix	stanfordUUID	aliquot_id	Case_UUID	Case_ID
<chr>	<chr>	<chr>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
TCGA-IG-A51D-01A	TCGA-IG-A51D	TCGA-IG-A51D-01A-31-A616-42	TP	Primary solid Tumor	2012	not reported	not reported	2019-08-08T16:38:14.344513-05:00	Squamous cell carcinoma, NOS	⋯	NA	NA	NA	NA	NA	ESCA-1E6AC686-96C2-46C4-A451-12D30114DD06-X012-S02-L027-B1-T1-P024	1E6AC686-96C2-46C4-A451-12D30114DD06	TCGA-IG-A51D-01A-31-A616-42	f8dbab24-b9f4-4b8a-bfea-57856ccf6364	TCGA-IG-A51D-01A-31-A616-42
TCGA-IG-A51D-01A	TCGA-IG-A51D	TCGA-IG-A51D-01A-31-A616-42	TP	Primary solid Tumor	2012	not reported	not reported	2019-08-08T16:38:14.344513-05:00	Squamous cell carcinoma, NOS	⋯	NA	NA	NA	NA	NA	ESCA-1E6AC686-96C2-46C4-A451-12D30114DD06-X012-S02-L028-B1-T2-P026	1E6AC686-96C2-46C4-A451-12D30114DD06	TCGA-IG-A51D-01A-31-A616-42	f8dbab24-b9f4-4b8a-bfea-57856ccf6364	TCGA-IG-A51D-01A-31-A616-42
TCGA-LN-A9FQ-01A	TCGA-LN-A9FQ	TCGA-LN-A9FQ-01A-11-A617-42	TP	Primary solid Tumor	2013	not reported	not reported	2019-08-08T16:39:06.938006-05:00	Squamous cell carcinoma, NOS	⋯	NA	NA	NA	NA	NA	ESCA-1FEF5E19-2351-4D46-99B4-63F0637ABF17-X010-S08-L063-B1-T1-P016	1FEF5E19-2351-4D46-99B4-63F0637ABF17	TCGA-LN-A9FQ-01A-11-A617-42	37d13493-975c-432c-bd21-65f383fc66c9	TCGA-LN-A9FQ-01A-11-A617-42
TCGA-LN-A9FQ-01A	TCGA-LN-A9FQ	TCGA-LN-A9FQ-01A-11-A617-42	TP	Primary solid Tumor	2013	not reported	not reported	2019-08-08T16:39:06.938006-05:00	Squamous cell carcinoma, NOS	⋯	NA	NA	NA	NA	NA	ESCA-1FEF5E19-2351-4D46-99B4-63F0637ABF17-X010-S08-L064-B1-T2-P016	1FEF5E19-2351-4D46-99B4-63F0637ABF17	TCGA-LN-A9FQ-01A-11-A617-42	37d13493-975c-432c-bd21-65f383fc66c9	TCGA-LN-A9FQ-01A-11-A617-42
TCGA-L7-A6VZ-01A	TCGA-L7-A6VZ	TCGA-L7-A6VZ-01A-31-A616-42	TP	Primary solid Tumor	2013	not reported	not reported	2019-08-08T16:38:43.222690-05:00	Adenocarcinoma, NOS	⋯	C1	C1	C2	C4	C7	ESCA-22102BD7-4D9F-4914-84D0-C64E2CB0D8C1-X018-S02-L027-B1-T1-P040	22102BD7-4D9F-4914-84D0-C64E2CB0D8C1	TCGA-L7-A6VZ-01A-31-A616-42	dc4062d7-1c81-4b19-aabd-8307a0df5029	TCGA-L7-A6VZ-01A-31-A616-42
TCGA-L7-A6VZ-01A	TCGA-L7-A6VZ	TCGA-L7-A6VZ-01A-31-A616-42	TP	Primary solid Tumor	2013	not reported	not reported	2019-08-08T16:38:43.222690-05:00	Adenocarcinoma, NOS	⋯	C1	C1	C2	C4	C7	ESCA-22102BD7-4D9F-4914-84D0-C64E2CB0D8C1-X018-S02-L028-B1-T2-P045	22102BD7-4D9F-4914-84D0-C64E2CB0D8C1	TCGA-L7-A6VZ-01A-31-A616-42	dc4062d7-1c81-4b19-aabd-8307a0df5029	TCGA-L7-A6VZ-01A-31-A616-42