Tuesday, May 15, 2018

JSON file I/O and GDC metadata download

What is JSON?

JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpler format, but the structure is similar to the XML.

Why JSON?

The reason why I come to this is because the metadata downloaded from the GDC website is JSON format. I was trying to compare the samples downloaded from different data categories.

How to download data from GDC?

1. We can directly by selecting the samples and add them into the cart and download it to the local computer. However, this is bit slow and if your destination is not local computer, you will spend extra time to transfer.
2. Use the gdc-client to download data set using the manifest file (gdc-client download). In this case, as long as manifest file provided, we can easily download using the terminal and directly download to the server.
*For RadHat server user, special instruction will be needed to install it. 
*for controlled data, authorization token needed for gdc-client

Here is the format of the manifest file:

id filename md5 size state
a8f1ba67-bcef-45d1-b508-d4ce580d4362 TCGA-GV-A3JV-10B-01D-A221_120830_SN590_0178_AD143RACXX_s_2_rg.sorted.bam f6ee0cf24f786edb34743e1a68e0b4de 19824636256 live
c85e79ff-7813-4679-9c88-de2565ff4afa G32450.TCGA-FD-A3N5-10A-01D-A21A-08.4.bam afd2efa5c9634bd02cbf9bb4a68c5314 160495436671 live

We can see, for each file, five information is used to download the specified file. But how can we map the samples download from different data category?

How to match sample from different data category?

1. select samples in the GDC API
2. add samples into cart
3. go to the cart and download the metadata
4. Extract metadata with useful information, such as TCGA sample ID, tumor project, tumor type, filename, id (which can be used to match to the manifest file), and etc.

Here I wrote a small python code (read_json.py) which can be used to extract those information.

How to read json file using python?

json file can be read into python using: import json.
The differences between json.load and json.loads are different. The latter one is reading a "string" while the former one is reading a file.

You can download the read_json.py file above and use it to extract the information from it. You can easily use it to extract the information needed. Example:
python read_json.py --meta_file path/to/file.json  --output_file  path/to/output.txt --log_file path/to/log.txt

With this code, following information will be extracted:

file_id;file_name;md5sum;entity_submitter_id;disease_type;primary_site;project;sample_type

file_id: can be used to match the id in manifest file.
file_name: can be used to match the filename in manifest file.
md5sum: can be used to match the md5 in manifest file.
entity_submitter_id: the TCGA sample barcode, such as : TCGA-HT-7602-01A-21D-2087-05
disease_type: cancer type
primary_site: cancer site
project: such as TCGA-LGG
sample_type: solid primary tumor or normal, etc.




Monday, April 30, 2018

DetectCNV: VarScan2 and CNVkit (WXS)

Foreword:

We can use either WGS or WXS to detect CNV by many different algorithms. In general speaking, the WGS is more reliable since WXS data will be affected by the discontinuous covered regions and the enrichment of capture kit (capture the exome regions).


Some review papers have summarized the algorithms/packages about their dis/advantages:
(1) computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives 2013 Zhao (2013) (for both WGS and WXS)
(2) an evaluation of copy number variation detection tools for cancer using whole exome sequencing data 2017 Zare (2017) (specific to WXS)

Main Content: 

I am working on some exome-seq data now, therefore, will focus on packages specific for WXS.


##ADTEx (source)


??I did try this method, however, met the same problem has posted online: the coverage file just keep updating until the job has been stopped by "taking large space" (~1.3T).
The problem has been posted here: ADTEx, coverage issue If you get the right answer, please post it.


##VarScan2



we have five steps. But it seems like the second step did not work for me (nothing generated from this step with the command provided). Also, the step 3 using DNAcopy (bioconductor) have the wrong code. The last step which need the mergeSegment.pl, need to revise some of the codes as posted online mergeSegment.pl revision (the last answer, very useful).

Here, I provided my revised analysis based on this post.



1. run samtools mpileup and varscan copynumber to obtain raw copynumber output.



On terminal:

samtools mpileup -q 1 -f ref.fa normal.bam tumor.bam | java -jar \
VarScan.jar  copynumber varScan --mpileup l --min-base-qual 20 \
--min-map-qual 20 --min-coverage 20 --min-segment-size 10 \
--max-segment-size 100  --p-value 0.01 --data-ratio depth_ratio

* the $depth_ratio can be calculated using the averaged depth of normal divide the averaged depth of tumor samples. It is used to adjust for the sequencing depth difference between tumor and normal (2018-May-15 updated: be careful with this, because it may bias the final results. For example, if the normal all have higher read depth to the tumor, the tumor may turns out all deletion events.)
* for some reason (I do not know), the output file always show as output.copynumber, then, I will suggest to create an independently folder for each sample to avoid the replacement between samples.

The output.copynumber file looks like:


$head output.copynumber 
chrom chr_start chr_stop num_positions normal_depth tumor_depth log2_ratio gc_content
chr1 13179 13278 100 43.7 58.9 0.323 55.0
chr1 13279 13378 100 65.7 106.5 0.589 60.0
chr1 13379 13398 20 28.5 49.0 0.674 50.0
chr1 14670 14716 47 25.3 58.0 1.087 63.8
chr1 16734 16833 100 64.0 97.4 0.498 58.0
chr1 16834 16915 82 39.6 50.2 0.237 59.8

chr1 17410 17509 100 42.2 76.0 0.739 68.0


2.  Use the DNAcopy (bioconductor) to process the raw copy number calls. code as below:


Rcode:

run_seg <- function(dir){
  library(DNAcopy)
  P1 = read.table(paste0(dir,"/output.copynumber"), header = T)
  CNV.object <- CNA(genomdat = P1[,7], chrom = P1[,1], maploc = P1[,2], data.type = "logratio")
  CNV.object.smoothed <- smooth.CNA(CNV.object)
  CNV.object.smoothed.seg <- segment(CNV.object.smoothed, verbose = 0 , min.width =2)
  seg.pvalue <- segments.p(CNV.object.smoothed.seg, ngrid = 100, tol =1e-6, alpha = 0.05, search.range = 100, nperm = 1000)
 # I removed all calls located on ChrM and other unknown contigs. 
  seg.pvalue2 <- seg.pvalue[which(seg.pvalue[,2] %in% c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22","chrX","chrY")),]
  seg.pvalue3 <- na.omit(seg.pvalue2)
#remove the first column which will have something like "sample.1" for all rows
  seg.pvalue4 <- seg.pvalue3[,-1]
#because the first column need to have "ID" as column name and each row should be different, therefore, create one by ourself
  seg.pvalue5 <- cbind(ID = seq(1, nrow(seg.pvalue4)), seg.pvalue4)
#write it out, with colnames print out but rownames did not print out
  write.table(seg.pvalue5, file = paste0(dir,"./DNAcopy_segmented.raw.txt"), sep = "\t", quote = F, row.names = F, col.names = T)
}
#run_seg(dir = "~")

*You can write it as a R function and apply it as I wrote, because I created an single folder for each sample, the directory should be given for each sample.
* The output DNAcopy_segmented.raw.txt file is the segmented CNV called events. 

The DNAcopy_segmented.raw.txt should looks like (first few rows):



ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl
1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945
2 chr1 69385 69585 3 -0.2673 4.96673605235722 1.72779807513072e-05 69585 69585
3 chr1 137789 733055 41 0.5602 7.76070062891583 5.55733108883229e-13 729580 736495
4 chr1 733155 939575 69 0.0991 3.84560576925794 0.00760174977634346 786557 953769
5 chr1 941075 1387435 908 0.2196 5.15737889167658 2.35801177726566e-05 1374631 1387843
6 chr1 1387535 1388694 9 0.6462 4.37943808685451 0.000412085514088083 1387943 1390674

7 chr1 1390133 1439397 87 0.2221 3.04328602314529 0.0828234619411378 1391306 1484973

* this information is very useful. and the final output is based on this.


3. Use the mergeSegment.pl to merge similar called events. However, after you download it mergeSegment.pl, you will need to revise it a bit as posted online:



I pasted the answer here:

1) add $stats{'num_merged_events'} in line 109:
$stats{'num_merged_events'} = $stats{'num_variants'} = $stats{'num_fail_pos'} = $stats{'num_fail_strand'} = $stats{'num_fail_varcount'} = $stats{'num_fail_varfreq'} = $stats{'num_fail_mmqs'} = $stats{'num_fail_var_mmqs'} = $stats{'num_fail_mapqual'} = $stats{'num_fail_readlen'} = $stats{'num_fail_dist3'} = $stats{'num_pass_filter'} = 0;
2) remove $sample in line 134:
my ($id, $chrom, $chr_start, $chr_stop, $num_mark, $seg_mean, $bstat, $p_value, $lcl, $ucl) = split(/\s+/, $line);
3) by the way, assure that the $ref_arm_sizes file is tab delemited and contains "chr" in the first column if you have "chr" before the chromosomes.
The DNAcopynumber_segmented.raw.txt generated in last step can be directly used here after you revised the perl file.

On terminal:



perl mergeSegments_edited.pl  DNAcopy_segmented.raw.txt --ref-arm-sizes   chr_arm_size_hg38.txt  --output-basename  sample.output.seg

* Besides the perl file, we still need another file to specify the reference arm size. The file looks like (first few rows):


chr1 1 123400000 p
chr1 123400000 248956422 q
chr2 1 93900000 p
chr2 93900000 242193529 q
chr3 1 90900000 p
chr3 90900000 198295559 q
chr4 1 50000000 p
chr4 50000000 190214555 q
chr5 1 48800000 p


*Notice: the ref_arm_size file need to specify all chromosomes you have in your DNAcopynumber_segmented.raw.txt file. Because I only include chr1-22, X,Y. Therefore, in my ref_arm_size.txt, I only have information for those chromosomes (chrM and other contigs were removed advance) (as specified by the authors, those should not be removed, but their coverage may be variable which called results may not be trustable)

The output of will be two files: sample.output.seg.events.tsv and sample.output.seg.summary.tsv. The first file includes all the details events after merge the segments. The second file only contain one row which provide the number of events called (like a summary).

The sample.output.seg.events.tsv looks like (few rows):

chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%
chr1 69385 69585 4.96673605235722 1 -0.2673 69585 deletion 201 focal chr1p 0.00% 0.00%
chr1 137789 733055 7.76070062891583 1 0.5602 729580 amplification 595267 focal chr1p 0.48% 0.24%
chr1 941075 1387435 5.15737889167658 1 0.2196 1374631 neutral 446361 focal chr1p 0.36% 0.18%


The sample.output.seg.summary.tsv only have one row (tab-delimited):

segments merged_events amps large-scale focal dels large-scale focal amp_regions del_regions
3048 3181 1142 2 1140 826 823 3 chr9p,chr9q chr10p,chr10p,chr13q

*What really need to be careful is the column names of events.tsv. I think their order and label is totally wrong (after I compared it with the DNAcopynumber_segmented.raw.txt file).

Here is the first two rows of events.tsv and DNAcopynumber_segmented.raw.txt, respectively:


chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%

ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl
1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945

The chr, chr_start, chr_stop are the same. But the seg_mean is different. I think the correct colnames for events.tsv should be:

chrom chr_start chr_stop bstat merged_seg seg_mean lcl event_type event_size chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%

The "merged_seg" is the number of events in DNAcopynumber_segmented.raw.txt merged in this file. Some called events in DNAcopynumber_segmented.raw.txt were removed, some of them were merged as one call in the events.tsv file. 



###CNVkit (tutorial)


For CNVkit, which is not covered by the two review paper I posted above. However, its performance was mentioned in CCR paper which compared the WXS+CNVkit versus the aCGH and FISH using pancreatic cancer biopsy. Since I am working on WXS of biopsy samples, I would like to try this also. 



1. Installation

Because all my bam file were stored in HPC, therefore, I will need to install this on HPC. This package was wrote with python. Therefore, need to install python with anaconda or miniconda first. I found one very useful tutorial(install python on local HPC account). After you installed the python with miniconda or anaconda, you need to install the CNVkit as described here.

In general:

# Configure the sources where conda will find packages 
conda config --add channels defaults 
conda config --add channels conda-forge 
conda config --add channels bioconda
# isntall with existing environment
conda install cnvkit


* just agree all requirement packages installation


2. Run CNVkit

In this tutorial, they provided the "copy number calling pipeline" with the cnvkit.py batch. However, I found the step by step one is easier to follow.

Terminal:


cnvkit.py access baits.bed --fasta hg19.fa -o access.hg19.bed 

*use the interval.list to define the accessible region and unaccessible region(which is called as antitarget in following)
* this will generate two file in my running: baits.target.bed and baits.antitarget.bed. However, the access.hg19.bed did not generate (I do not know why, but it does not affect the following steps)

cnvkit.py autobin *.bam -t baits.bed -g access.hg19.bed [--annotate refFlat.txt --short-names] 

*I actually did not run this step, because my access.hg19.bed did not generate properly

 # For each sample... 
cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn 
cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn 

* for each bam file (including both normal and tumor), generate their target.bed and antitarget.bed file.

 # With all normal samples...
 cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg19.fa -o my_reference.cnn 

*use all normal bed file generated in last step, to build the reference file.

 # For each tumor sample... 
cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr cnvkit.py segment Sample.cnr -o Sample.cns 

*use the reference file to fix the tumor cnn, then do segmentation

 # Optionally, with --scatter and --diagram 
cnvkit.py scatter Sample.cnr -s Sample.cns -o Sample-scatter.pdf 
cnvkit.py diagram Sample.cnr -s Sample.cns -o Sample-diagram.pdf

*draw plots based on the cns file

In fact, most steps are very fast and only takes few minutes. This whole process also follow the pictures shown in tutorial.

Here I show how the data looks like in my sample:

sample.targetcoverage.cnn:

chromosome start end gene depth log2
chr1 12080 12251 - 191.018 7.57756
chr1 12595 12802 - 146.932 7.19901
chr1 13163 13658 - 263.309 8.04061
chr1 14620 15015 - 292.253 8.19107

sample.antitargetcoverage.cnn:

chromosome start end gene depth log2

chr1 150500 204324 Antitarget 17.7388 4.14884
chr1 204324 258148 Antitarget 1.32545 0.406482
chr1 259550 296887 Antitarget 0.844792 -0.243332
chr1 298007 347468 Antitarget 0 -20

sample.cnr:

chromosome start end gene log2 depth weight

chr1 12080 12251 - -0.0459285 191.018 0.503366
chr1 12595 12802 - 0.300996 146.932 0.505221
chr1 13163 13658 - 0.200901 263.309 0.606416

sample.cns:

chromosome start end gene log2 depth probes weight

chr1 12080 127570996 - -0.533832 88.0996 19188 9109.46
chr1 128799660 142787526 - -16.7501 5.35656e-05 47 21.9934
chr1 143165577 248937047 - -0.00995558 130.76 16375 7820.12
chr2 41449 93178706 - -0.0144545 114.778 10507 4877.72

The two plots looks like:
diagram plot:

 scatter plot:

###comments

In general speaking, the CNVkit is faster and easier (without jumping from linux to R and back to linux). The results generated from VarScan2 and CNVkit is quite different:
1. VarScan2 usually generated 2k~3k events (some of them are small). But CNVkit called usually <100 events.
2. VarScan3 can only call events on the regions covered by the capture kit. But CNVkit (if you follow my code) cover all regions. (Of course, you can change this in CNVkit using the -m hybrid/amplicon) 


JSON file I/O and GDC metadata download

What is JSON? JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpl...