run_seg <- function(dir){
library(DNAcopy)
P1 = read.table(paste0(dir,"/output.copynumber"), header = T)
CNV.object <- CNA(genomdat = P1[,7], chrom = P1[,1], maploc = P1[,2], data.type = "logratio")
CNV.object.smoothed <- smooth.CNA(CNV.object)
CNV.object.smoothed.seg <- segment(CNV.object.smoothed, verbose = 0 , min.width =2)
seg.pvalue <- segments.p(CNV.object.smoothed.seg, ngrid = 100, tol =1e-6, alpha = 0.05, search.range = 100, nperm = 1000)
# I removed all calls located on ChrM and other unknown contigs.
seg.pvalue2 <- seg.pvalue[which(seg.pvalue[,2] %in% c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22","chrX","chrY")),]
seg.pvalue3 <- na.omit(seg.pvalue2)
#remove the first column which will have something like "sample.1" for all rows
seg.pvalue4 <- seg.pvalue3[,-1]
#because the first column need to have "ID" as column name and each row should be different, therefore, create one by ourself
seg.pvalue5 <- cbind(ID = seq(1, nrow(seg.pvalue4)), seg.pvalue4)
#write it out, with colnames print out but rownames did not print out
write.table(seg.pvalue5, file = paste0(dir,"./DNAcopy_segmented.raw.txt"), sep = "\t", quote = F, row.names = F, col.names = T)
}
#run_seg(dir = "~")
*You can write it as a R function and apply it as I wrote, because I created an single folder for each sample, the directory should be given for each sample.
* The output DNAcopy_segmented.raw.txt file is the segmented CNV called events.
The DNAcopy_segmented.raw.txt should looks like (first few rows):
ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl
1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945
2 chr1 69385 69585 3 -0.2673 4.96673605235722 1.72779807513072e-05 69585 69585
3 chr1 137789 733055 41 0.5602 7.76070062891583 5.55733108883229e-13 729580 736495
4 chr1 733155 939575 69 0.0991 3.84560576925794 0.00760174977634346 786557 953769
5 chr1 941075 1387435 908 0.2196 5.15737889167658 2.35801177726566e-05 1374631 1387843
6 chr1 1387535 1388694 9 0.6462 4.37943808685451 0.000412085514088083 1387943 1390674
7 chr1 1390133 1439397 87 0.2221 3.04328602314529 0.0828234619411378 1391306 1484973
* this information is very useful. and the final output is based on this.
3. Use the mergeSegment.pl to merge similar called events. However, after you download it mergeSegment.pl, you will need to revise it a bit as posted online:
I pasted the answer here:
1) add $stats{'num_merged_events'} in line 109:
$stats{'num_merged_events'} = $stats{'num_variants'} = $stats{'num_fail_pos'} = $stats{'num_fail_strand'} = $stats{'num_fail_varcount'} = $stats{'num_fail_varfreq'} = $stats{'num_fail_mmqs'} = $stats{'num_fail_var_mmqs'} = $stats{'num_fail_mapqual'} = $stats{'num_fail_readlen'} = $stats{'num_fail_dist3'} = $stats{'num_pass_filter'} = 0;
2) remove $sample in line 134:
my ($id, $chrom, $chr_start, $chr_stop, $num_mark, $seg_mean, $bstat, $p_value, $lcl, $ucl) = split(/\s+/, $line);
3) by the way, assure that the $ref_arm_sizes file is tab delemited and contains "chr" in the first column if you have "chr" before the chromosomes.
The DNAcopynumber_segmented.raw.txt generated in last step can be directly used here after you revised the perl file.
On terminal:
perl mergeSegments_edited.pl DNAcopy_segmented.raw.txt --ref-arm-sizes chr_arm_size_hg38.txt --output-basename sample.output.seg
* Besides the perl file, we still need another file to specify the reference arm size. The file looks like (first few rows):
chr1 1 123400000 p
chr1 123400000 248956422 q
chr2 1 93900000 p
chr2 93900000 242193529 q
chr3 1 90900000 p
chr3 90900000 198295559 q
chr4 1 50000000 p
chr4 50000000 190214555 q
chr5 1 48800000 p
*Notice: the ref_arm_size file need to specify all chromosomes you have in your DNAcopynumber_segmented.raw.txt file. Because I only include chr1-22, X,Y. Therefore, in my ref_arm_size.txt, I only have information for those chromosomes (chrM and other contigs were removed advance) (as specified by the authors, those should not be removed, but their coverage may be variable which called results may not be trustable)
The output of will be two files: sample.output.seg.events.tsv and sample.output.seg.summary.tsv. The first file includes all the details events after merge the segments. The second file only contain one row which provide the number of events called (like a summary).
The sample.output.seg.events.tsv looks like (few rows):
chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%
chr1 69385 69585 4.96673605235722 1 -0.2673 69585 deletion 201 focal chr1p 0.00% 0.00%
chr1 137789 733055 7.76070062891583 1 0.5602 729580 amplification 595267 focal chr1p 0.48% 0.24%
chr1 941075 1387435 5.15737889167658 1 0.2196 1374631 neutral 446361 focal chr1p 0.36% 0.18%
The sample.output.seg.summary.tsv only have one row (tab-delimited):
segments merged_events amps large-scale focal dels large-scale focal amp_regions del_regions
3048 3181 1142 2 1140 826 823 3 chr9p,chr9q chr10p,chr10p,chr13q
*What really need to be careful is the column names of events.tsv. I think their order and label is totally wrong (after I compared it with the DNAcopynumber_segmented.raw.txt file).
Here is the first two rows of events.tsv and DNAcopynumber_segmented.raw.txt, respectively:
chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%
ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl
1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945
The chr, chr_start, chr_stop are the same. But the seg_mean is different. I think the correct colnames for events.tsv should be:
chrom chr_start chr_stop bstat merged_seg seg_mean lcl event_type event_size chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%
The "merged_seg" is the number of events in DNAcopynumber_segmented.raw.txt merged in this file. Some called events in DNAcopynumber_segmented.raw.txt were removed, some of them were merged as one call in the events.tsv file.
For CNVkit, which is not covered by the two review paper I posted above. However, its performance was mentioned in CCR paper which compared the WXS+CNVkit versus the aCGH and FISH using pancreatic cancer biopsy. Since I am working on WXS of biopsy samples, I would like to try this also.
1. Installation
Because all my bam file were stored in HPC, therefore, I will need to install this on HPC. This package was wrote with python. Therefore, need to install python with anaconda or miniconda first. I found one very useful tutorial(install python on local HPC account). After you installed the python with miniconda or anaconda, you need to install the CNVkit as described here.
In general:
# Configure the sources where conda will find packages
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
# isntall with existing environment
conda install cnvkit
* just agree all requirement packages installation
2. Run CNVkit
In this tutorial, they provided the "copy number calling pipeline" with the cnvkit.py batch. However, I found the step by step one is easier to follow.
Terminal:
cnvkit.py access baits.bed --fasta hg19.fa -o access.hg19.bed
*use the interval.list to define the accessible region and unaccessible region(which is called as antitarget in following)
* this will generate two file in my running: baits.target.bed and baits.antitarget.bed. However, the access.hg19.bed did not generate (I do not know why, but it does not affect the following steps)
cnvkit.py autobin *.bam -t baits.bed -g access.hg19.bed [--annotate refFlat.txt --short-names]
*I actually did not run this step, because my access.hg19.bed did not generate properly
# For each sample...
cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn
cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn
* for each bam file (including both normal and tumor), generate their target.bed and antitarget.bed file.
# With all normal samples...
cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg19.fa -o my_reference.cnn
*use all normal bed file generated in last step, to build the reference file.
# For each tumor sample...
cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr
cnvkit.py segment Sample.cnr -o Sample.cns
*use the reference file to fix the tumor cnn, then do segmentation
# Optionally, with --scatter and --diagram
cnvkit.py scatter Sample.cnr -s Sample.cns -o Sample-scatter.pdf
cnvkit.py diagram Sample.cnr -s Sample.cns -o Sample-diagram.pdf
*draw plots based on the cns file
In fact, most steps are very fast and only takes few minutes. This whole process also follow the pictures shown in tutorial.
Here I show how the data looks like in my sample:
sample.targetcoverage.cnn:
chromosome start end gene depth log2
chr1 12080 12251 - 191.018 7.57756
chr1 12595 12802 - 146.932 7.19901
chr1 13163 13658 - 263.309 8.04061
chr1 14620 15015 - 292.253 8.19107
sample.antitargetcoverage.cnn:
chromosome start end gene depth log2
chr1 150500 204324 Antitarget 17.7388 4.14884
chr1 204324 258148 Antitarget 1.32545 0.406482
chr1 259550 296887 Antitarget 0.844792 -0.243332
chr1 298007 347468 Antitarget 0 -20
sample.cnr:
chromosome start end gene log2 depth weight
chr1 12080 12251 - -0.0459285 191.018 0.503366
chr1 12595 12802 - 0.300996 146.932 0.505221
chr1 13163 13658 - 0.200901 263.309 0.606416
sample.cns:
chromosome start end gene log2 depth probes weight
chr1 12080 127570996 - -0.533832 88.0996 19188 9109.46
chr1 128799660 142787526 - -16.7501 5.35656e-05 47 21.9934
chr1 143165577 248937047 - -0.00995558 130.76 16375 7820.12
chr2 41449 93178706 - -0.0144545 114.778 10507 4877.72
The two plots looks like:
diagram plot:
scatter plot:
###comments
In general speaking, the CNVkit is faster and easier (without jumping from linux to R and back to linux). The results generated from VarScan2 and CNVkit is quite different:
1. VarScan2 usually generated 2k~3k events (some of them are small). But CNVkit called usually <100 events.
2. VarScan3 can only call events on the regions covered by the capture kit. But CNVkit (if you follow my code) cover all regions. (Of course, you can change this in CNVkit using the -m hybrid/amplicon)