Computational Biology: 2018

Foreword:

We can use either WGS or WXS to detect CNV by many different algorithms. In general speaking, the WGS is more reliable since WXS data will be affected by the discontinuous covered regions and the enrichment of capture kit (capture the exome regions).

Some review papers have summarized the algorithms/packages about their dis/advantages:
(1) computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives 2013 Zhao (2013) (for both WGS and WXS)
(2) an evaluation of copy number variation detection tools for cancer using whole exome sequencing data 2017 Zare (2017) (specific to WXS)

Main Content:

I am working on some exome-seq data now, therefore, will focus on packages specific for WXS.

##ADTEx (source)

??I did try this method, however, met the same problem has posted online: the coverage file just keep updating until the job has been stopped by "taking large space" (~1.3T).
The problem has been posted here: ADTEx, coverage issue If you get the right answer, please post it.

##VarScan2

we have five steps. But it seems like the second step did not work for me (nothing generated from this step with the command provided). Also, the step 3 using DNAcopy (bioconductor) have the wrong code. The last step which need the mergeSegment.pl, need to revise some of the codes as posted online mergeSegment.pl revision (the last answer, very useful).

Here, I provided my revised analysis based on this post.

1. run samtools mpileup and varscan copynumber to obtain raw copynumber output.

On terminal:

samtools mpileup -q 1 -f ref.fa normal.bam tumor.bam | java -jar \
VarScan.jar copynumber varScan --mpileup l --min-base-qual 20 \
--min-map-qual 20 --min-coverage 20 --min-segment-size 10 \
--max-segment-size 100 --p-value 0.01 --data-ratio depth_ratio

* the $depth_ratio can be calculated using the averaged depth of normal divide the averaged depth of tumor samples. It is used to adjust for the sequencing depth difference between tumor and normal (2018-May-15 updated: be careful with this, because it may bias the final results. For example, if the normal all have higher read depth to the tumor, the tumor may turns out all deletion events.)

* for some reason (I do not know), the output file always show as output.copynumber, then, I will suggest to create an independently folder for each sample to avoid the replacement between samples.

The output.copynumber file looks like:

$head output.copynumber

chrom chr_start chr_stop num_positions normal_depth tumor_depth log2_ratio gc_content

chr1 13179 13278 100 43.7 58.9 0.323 55.0

chr1 13279 13378 100 65.7 106.5 0.589 60.0

chr1 13379 13398 20 28.5 49.0 0.674 50.0

chr1 14670 14716 47 25.3 58.0 1.087 63.8

chr1 16734 16833 100 64.0 97.4 0.498 58.0

chr1 16834 16915 82 39.6 50.2 0.237 59.8

chr1 17410 17509 100 42.2 76.0 0.739 68.0

2. Use the DNAcopy (bioconductor) to process the raw copy number calls. code as below:

Rcode:

run_seg <- function(dir){

library(DNAcopy)

P1 = read.table(paste0(dir,"/output.copynumber"), header = T)

CNV.object <- CNA(genomdat = P1[,7], chrom = P1[,1], maploc = P1[,2], data.type = "logratio")

CNV.object.smoothed <- smooth.CNA(CNV.object)

CNV.object.smoothed.seg <- segment(CNV.object.smoothed, verbose = 0 , min.width =2)

seg.pvalue <- segments.p(CNV.object.smoothed.seg, ngrid = 100, tol =1e-6, alpha = 0.05, search.range = 100, nperm = 1000)
# I removed all calls located on ChrM and other unknown contigs.

seg.pvalue2 <- seg.pvalue[which(seg.pvalue[,2] %in% c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22","chrX","chrY")),]

seg.pvalue3 <- na.omit(seg.pvalue2)
#remove the first column which will have something like "sample.1" for all rows

seg.pvalue4 <- seg.pvalue3[,-1]
#because the first column need to have "ID" as column name and each row should be different, therefore, create one by ourself

seg.pvalue5 <- cbind(ID = seq(1, nrow(seg.pvalue4)), seg.pvalue4)
#write it out, with colnames print out but rownames did not print out

write.table(seg.pvalue5, file = paste0(dir,"./DNAcopy_segmented.raw.txt"), sep = "\t", quote = F, row.names = F, col.names = T)

}

#run_seg(dir = "~")

*You can write it as a R function and apply it as I wrote, because I created an single folder for each sample, the directory should be given for each sample.

* The output DNAcopy_segmented.raw.txt file is the segmented CNV called events.

The DNAcopy_segmented.raw.txt should looks like (first few rows):

ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl

1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945

2 chr1 69385 69585 3 -0.2673 4.96673605235722 1.72779807513072e-05 69585 69585

3 chr1 137789 733055 41 0.5602 7.76070062891583 5.55733108883229e-13 729580 736495

4 chr1 733155 939575 69 0.0991 3.84560576925794 0.00760174977634346 786557 953769

5 chr1 941075 1387435 908 0.2196 5.15737889167658 2.35801177726566e-05 1374631 1387843

6 chr1 1387535 1388694 9 0.6462 4.37943808685451 0.000412085514088083 1387943 1390674

7 chr1 1390133 1439397 87 0.2221 3.04328602314529 0.0828234619411378 1391306 1484973

* this information is very useful. and the final output is based on this.

3. Use the mergeSegment.pl to merge similar called events. However, after you download it mergeSegment.pl, you will need to revise it a bit as posted online:

I pasted the answer here:

1) add $stats{'num_merged_events'} in line 109:

$stats{'num_merged_events'} = $stats{'num_variants'} = $stats{'num_fail_pos'} = $stats{'num_fail_strand'} = $stats{'num_fail_varcount'} = $stats{'num_fail_varfreq'} = $stats{'num_fail_mmqs'} = $stats{'num_fail_var_mmqs'} = $stats{'num_fail_mapqual'} = $stats{'num_fail_readlen'} = $stats{'num_fail_dist3'} = $stats{'num_pass_filter'} = 0;

2) remove $sample in line 134:

my ($id, $chrom, $chr_start, $chr_stop, $num_mark, $seg_mean, $bstat, $p_value, $lcl, $ucl) = split(/\s+/, $line);

3) by the way, assure that the $ref_arm_sizes file is tab delemited and contains "chr" in the first column if you have "chr" before the chromosomes.

The DNAcopynumber_segmented.raw.txt generated in last step can be directly used here after you revised the perl file.

On terminal:

perl mergeSegments_edited.pl DNAcopy_segmented.raw.txt --ref-arm-sizes chr_arm_size_hg38.txt --output-basename sample.output.seg

* Besides the perl file, we still need another file to specify the reference arm size. The file looks like (first few rows):

chr1 1 123400000 p

chr1 123400000 248956422 q

chr2 1 93900000 p

chr2 93900000 242193529 q

chr3 1 90900000 p

chr3 90900000 198295559 q

chr4 1 50000000 p

chr4 50000000 190214555 q

chr5 1 48800000 p

*Notice: the ref_arm_size file need to specify all chromosomes you have in your DNAcopynumber_segmented.raw.txt file. Because I only include chr1-22, X,Y. Therefore, in my ref_arm_size.txt, I only have information for those chromosomes (chrM and other contigs were removed advance) (as specified by the authors, those should not be removed, but their coverage may be variable which called results may not be trustable)

The output of will be two files: sample.output.seg.events.tsv and sample.output.seg.summary.tsv. The first file includes all the details events after merge the segments. The second file only contain one row which provide the number of events called (like a summary).

The sample.output.seg.events.tsv looks like (few rows):

chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction

chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%

chr1 69385 69585 4.96673605235722 1 -0.2673 69585 deletion 201 focal chr1p 0.00% 0.00%

chr1 137789 733055 7.76070062891583 1 0.5602 729580 amplification 595267 focal chr1p 0.48% 0.24%

chr1 941075 1387435 5.15737889167658 1 0.2196 1374631 neutral 446361 focal chr1p 0.36% 0.18%

The sample.output.seg.summary.tsv only have one row (tab-delimited):

segments merged_events amps large-scale focal dels large-scale focal amp_regions del_regions

3048 3181 1142 2 1140 826 823 3 chr9p,chr9q chr10p,chr10p,chr13q

*What really need to be careful is the column names of events.tsv. I think their order and label is totally wrong (after I compared it with the DNAcopynumber_segmented.raw.txt file).

Here is the first two rows of events.tsv and DNAcopynumber_segmented.raw.txt, respectively:

chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction

chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%

ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl

1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945

The chr, chr_start, chr_stop are the same. But the seg_mean is different. I think the correct colnames for events.tsv should be:

chrom chr_start chr_stop bstat merged_seg seg_mean lcl event_type event_size chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%

The "merged_seg" is the number of events in DNAcopynumber_segmented.raw.txt merged in this file. Some called events in DNAcopynumber_segmented.raw.txt were removed, some of them were merged as one call in the events.tsv file.

###CNVkit (tutorial)

For CNVkit, which is not covered by the two review paper I posted above. However, its performance was mentioned in CCR paper which compared the WXS+CNVkit versus the aCGH and FISH using pancreatic cancer biopsy. Since I am working on WXS of biopsy samples, I would like to try this also.

1. Installation

Because all my bam file were stored in HPC, therefore, I will need to install this on HPC. This package was wrote with python. Therefore, need to install python with anaconda or miniconda first. I found one very useful tutorial(install python on local HPC account). After you installed the python with miniconda or anaconda, you need to install the CNVkit as described here.

In general:

# Configure the sources where conda will find packages

conda config --add channels defaults

conda config --add channels conda-forge

conda config --add channels bioconda

# isntall with existing environment
conda install cnvkit

* just agree all requirement packages installation

2. Run CNVkit

In this tutorial, they provided the "copy number calling pipeline" with the cnvkit.py batch. However, I found the step by step one is easier to follow.

Terminal:

cnvkit.py access baits.bed --fasta hg19.fa -o access.hg19.bed

*use the interval.list to define the accessible region and unaccessible region(which is called as antitarget in following)
* this will generate two file in my running: baits.target.bed and baits.antitarget.bed. However, the access.hg19.bed did not generate (I do not know why, but it does not affect the following steps)

cnvkit.py autobin *.bam -t baits.bed -g access.hg19.bed [--annotate refFlat.txt --short-names]

*I actually did not run this step, because my access.hg19.bed did not generate properly

# For each sample...

cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn

cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn

* for each bam file (including both normal and tumor), generate their target.bed and antitarget.bed file.

# With all normal samples...

cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg19.fa -o my_reference.cnn

*use all normal bed file generated in last step, to build the reference file.

# For each tumor sample...

cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr cnvkit.py segment Sample.cnr -o Sample.cns

*use the reference file to fix the tumor cnn, then do segmentation

# Optionally, with --scatter and --diagram

cnvkit.py scatter Sample.cnr -s Sample.cns -o Sample-scatter.pdf

cnvkit.py diagram Sample.cnr -s Sample.cns -o Sample-diagram.pdf

*draw plots based on the cns file

In fact, most steps are very fast and only takes few minutes. This whole process also follow the pictures shown in tutorial.

Here I show how the data looks like in my sample:

sample.targetcoverage.cnn:

chromosome start end gene depth log2

chr1 12080 12251 - 191.018 7.57756

chr1 12595 12802 - 146.932 7.19901

chr1 13163 13658 - 263.309 8.04061

chr1 14620 15015 - 292.253 8.19107

sample.antitargetcoverage.cnn:

chromosome start end gene depth log2

chr1 150500 204324 Antitarget 17.7388 4.14884

chr1 204324 258148 Antitarget 1.32545 0.406482

chr1 259550 296887 Antitarget 0.844792 -0.243332

chr1 298007 347468 Antitarget 0 -20

sample.cnr:

chromosome start end gene log2 depth weight

chr1 12080 12251 - -0.0459285 191.018 0.503366

chr1 12595 12802 - 0.300996 146.932 0.505221

chr1 13163 13658 - 0.200901 263.309 0.606416

sample.cns:

chromosome start end gene log2 depth probes weight

chr1 12080 127570996 - -0.533832 88.0996 19188 9109.46

chr1 128799660 142787526 - -16.7501 5.35656e-05 47 21.9934

chr1 143165577 248937047 - -0.00995558 130.76 16375 7820.12

chr2 41449 93178706 - -0.0144545 114.778 10507 4877.72

The two plots looks like:

diagram plot:

scatter plot:

###comments

In general speaking, the CNVkit is faster and easier (without jumping from linux to R and back to linux). The results generated from VarScan2 and CNVkit is quite different:

1. VarScan2 usually generated 2k~3k events (some of them are small). But CNVkit called usually <100 events.

2. VarScan3 can only call events on the regions covered by the capture kit. But CNVkit (if you follow my code) cover all regions. (Of course, you can change this in CNVkit using the -m hybrid/amplicon)

Computational Biology

Tuesday, May 15, 2018

JSON file I/O and GDC metadata download

What is JSON?

Why JSON?

How to download data from GDC?

How to match sample from different data category?

How to read json file using python?

Monday, April 30, 2018

DetectCNV: VarScan2 and CNVkit (WXS)