Tuesday, May 15, 2018

JSON file I/O and GDC metadata download

What is JSON?

JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpler format, but the structure is similar to the XML.

Why JSON?

The reason why I come to this is because the metadata downloaded from the GDC website is JSON format. I was trying to compare the samples downloaded from different data categories.

How to download data from GDC?

1. We can directly by selecting the samples and add them into the cart and download it to the local computer. However, this is bit slow and if your destination is not local computer, you will spend extra time to transfer.
2. Use the gdc-client to download data set using the manifest file (gdc-client download). In this case, as long as manifest file provided, we can easily download using the terminal and directly download to the server.
*For RadHat server user, special instruction will be needed to install it. 
*for controlled data, authorization token needed for gdc-client

Here is the format of the manifest file:

id filename md5 size state
a8f1ba67-bcef-45d1-b508-d4ce580d4362 TCGA-GV-A3JV-10B-01D-A221_120830_SN590_0178_AD143RACXX_s_2_rg.sorted.bam f6ee0cf24f786edb34743e1a68e0b4de 19824636256 live
c85e79ff-7813-4679-9c88-de2565ff4afa G32450.TCGA-FD-A3N5-10A-01D-A21A-08.4.bam afd2efa5c9634bd02cbf9bb4a68c5314 160495436671 live

We can see, for each file, five information is used to download the specified file. But how can we map the samples download from different data category?

How to match sample from different data category?

1. select samples in the GDC API
2. add samples into cart
3. go to the cart and download the metadata
4. Extract metadata with useful information, such as TCGA sample ID, tumor project, tumor type, filename, id (which can be used to match to the manifest file), and etc.

Here I wrote a small python code (read_json.py) which can be used to extract those information.

How to read json file using python?

json file can be read into python using: import json.
The differences between json.load and json.loads are different. The latter one is reading a "string" while the former one is reading a file.

You can download the read_json.py file above and use it to extract the information from it. You can easily use it to extract the information needed. Example:
python read_json.py --meta_file path/to/file.json  --output_file  path/to/output.txt --log_file path/to/log.txt

With this code, following information will be extracted:

file_id;file_name;md5sum;entity_submitter_id;disease_type;primary_site;project;sample_type

file_id: can be used to match the id in manifest file.
file_name: can be used to match the filename in manifest file.
md5sum: can be used to match the md5 in manifest file.
entity_submitter_id: the TCGA sample barcode, such as : TCGA-HT-7602-01A-21D-2087-05
disease_type: cancer type
primary_site: cancer site
project: such as TCGA-LGG
sample_type: solid primary tumor or normal, etc.




Monday, April 30, 2018

DetectCNV: VarScan2 and CNVkit (WXS)

Foreword:

We can use either WGS or WXS to detect CNV by many different algorithms. In general speaking, the WGS is more reliable since WXS data will be affected by the discontinuous covered regions and the enrichment of capture kit (capture the exome regions).


Some review papers have summarized the algorithms/packages about their dis/advantages:
(1) computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives 2013 Zhao (2013) (for both WGS and WXS)
(2) an evaluation of copy number variation detection tools for cancer using whole exome sequencing data 2017 Zare (2017) (specific to WXS)

Main Content: 

I am working on some exome-seq data now, therefore, will focus on packages specific for WXS.


##ADTEx (source)


??I did try this method, however, met the same problem has posted online: the coverage file just keep updating until the job has been stopped by "taking large space" (~1.3T).
The problem has been posted here: ADTEx, coverage issue If you get the right answer, please post it.


##VarScan2



we have five steps. But it seems like the second step did not work for me (nothing generated from this step with the command provided). Also, the step 3 using DNAcopy (bioconductor) have the wrong code. The last step which need the mergeSegment.pl, need to revise some of the codes as posted online mergeSegment.pl revision (the last answer, very useful).

Here, I provided my revised analysis based on this post.



1. run samtools mpileup and varscan copynumber to obtain raw copynumber output.



On terminal:

samtools mpileup -q 1 -f ref.fa normal.bam tumor.bam | java -jar \
VarScan.jar  copynumber varScan --mpileup l --min-base-qual 20 \
--min-map-qual 20 --min-coverage 20 --min-segment-size 10 \
--max-segment-size 100  --p-value 0.01 --data-ratio depth_ratio

* the $depth_ratio can be calculated using the averaged depth of normal divide the averaged depth of tumor samples. It is used to adjust for the sequencing depth difference between tumor and normal (2018-May-15 updated: be careful with this, because it may bias the final results. For example, if the normal all have higher read depth to the tumor, the tumor may turns out all deletion events.)
* for some reason (I do not know), the output file always show as output.copynumber, then, I will suggest to create an independently folder for each sample to avoid the replacement between samples.

The output.copynumber file looks like:


$head output.copynumber 
chrom chr_start chr_stop num_positions normal_depth tumor_depth log2_ratio gc_content
chr1 13179 13278 100 43.7 58.9 0.323 55.0
chr1 13279 13378 100 65.7 106.5 0.589 60.0
chr1 13379 13398 20 28.5 49.0 0.674 50.0
chr1 14670 14716 47 25.3 58.0 1.087 63.8
chr1 16734 16833 100 64.0 97.4 0.498 58.0
chr1 16834 16915 82 39.6 50.2 0.237 59.8

chr1 17410 17509 100 42.2 76.0 0.739 68.0


2.  Use the DNAcopy (bioconductor) to process the raw copy number calls. code as below:


Rcode:

run_seg <- function(dir){
  library(DNAcopy)
  P1 = read.table(paste0(dir,"/output.copynumber"), header = T)
  CNV.object <- CNA(genomdat = P1[,7], chrom = P1[,1], maploc = P1[,2], data.type = "logratio")
  CNV.object.smoothed <- smooth.CNA(CNV.object)
  CNV.object.smoothed.seg <- segment(CNV.object.smoothed, verbose = 0 , min.width =2)
  seg.pvalue <- segments.p(CNV.object.smoothed.seg, ngrid = 100, tol =1e-6, alpha = 0.05, search.range = 100, nperm = 1000)
 # I removed all calls located on ChrM and other unknown contigs. 
  seg.pvalue2 <- seg.pvalue[which(seg.pvalue[,2] %in% c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22","chrX","chrY")),]
  seg.pvalue3 <- na.omit(seg.pvalue2)
#remove the first column which will have something like "sample.1" for all rows
  seg.pvalue4 <- seg.pvalue3[,-1]
#because the first column need to have "ID" as column name and each row should be different, therefore, create one by ourself
  seg.pvalue5 <- cbind(ID = seq(1, nrow(seg.pvalue4)), seg.pvalue4)
#write it out, with colnames print out but rownames did not print out
  write.table(seg.pvalue5, file = paste0(dir,"./DNAcopy_segmented.raw.txt"), sep = "\t", quote = F, row.names = F, col.names = T)
}
#run_seg(dir = "~")

*You can write it as a R function and apply it as I wrote, because I created an single folder for each sample, the directory should be given for each sample.
* The output DNAcopy_segmented.raw.txt file is the segmented CNV called events. 

The DNAcopy_segmented.raw.txt should looks like (first few rows):



ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl
1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945
2 chr1 69385 69585 3 -0.2673 4.96673605235722 1.72779807513072e-05 69585 69585
3 chr1 137789 733055 41 0.5602 7.76070062891583 5.55733108883229e-13 729580 736495
4 chr1 733155 939575 69 0.0991 3.84560576925794 0.00760174977634346 786557 953769
5 chr1 941075 1387435 908 0.2196 5.15737889167658 2.35801177726566e-05 1374631 1387843
6 chr1 1387535 1388694 9 0.6462 4.37943808685451 0.000412085514088083 1387943 1390674

7 chr1 1390133 1439397 87 0.2221 3.04328602314529 0.0828234619411378 1391306 1484973

* this information is very useful. and the final output is based on this.


3. Use the mergeSegment.pl to merge similar called events. However, after you download it mergeSegment.pl, you will need to revise it a bit as posted online:



I pasted the answer here:

1) add $stats{'num_merged_events'} in line 109:
$stats{'num_merged_events'} = $stats{'num_variants'} = $stats{'num_fail_pos'} = $stats{'num_fail_strand'} = $stats{'num_fail_varcount'} = $stats{'num_fail_varfreq'} = $stats{'num_fail_mmqs'} = $stats{'num_fail_var_mmqs'} = $stats{'num_fail_mapqual'} = $stats{'num_fail_readlen'} = $stats{'num_fail_dist3'} = $stats{'num_pass_filter'} = 0;
2) remove $sample in line 134:
my ($id, $chrom, $chr_start, $chr_stop, $num_mark, $seg_mean, $bstat, $p_value, $lcl, $ucl) = split(/\s+/, $line);
3) by the way, assure that the $ref_arm_sizes file is tab delemited and contains "chr" in the first column if you have "chr" before the chromosomes.
The DNAcopynumber_segmented.raw.txt generated in last step can be directly used here after you revised the perl file.

On terminal:



perl mergeSegments_edited.pl  DNAcopy_segmented.raw.txt --ref-arm-sizes   chr_arm_size_hg38.txt  --output-basename  sample.output.seg

* Besides the perl file, we still need another file to specify the reference arm size. The file looks like (first few rows):


chr1 1 123400000 p
chr1 123400000 248956422 q
chr2 1 93900000 p
chr2 93900000 242193529 q
chr3 1 90900000 p
chr3 90900000 198295559 q
chr4 1 50000000 p
chr4 50000000 190214555 q
chr5 1 48800000 p


*Notice: the ref_arm_size file need to specify all chromosomes you have in your DNAcopynumber_segmented.raw.txt file. Because I only include chr1-22, X,Y. Therefore, in my ref_arm_size.txt, I only have information for those chromosomes (chrM and other contigs were removed advance) (as specified by the authors, those should not be removed, but their coverage may be variable which called results may not be trustable)

The output of will be two files: sample.output.seg.events.tsv and sample.output.seg.summary.tsv. The first file includes all the details events after merge the segments. The second file only contain one row which provide the number of events called (like a summary).

The sample.output.seg.events.tsv looks like (few rows):

chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%
chr1 69385 69585 4.96673605235722 1 -0.2673 69585 deletion 201 focal chr1p 0.00% 0.00%
chr1 137789 733055 7.76070062891583 1 0.5602 729580 amplification 595267 focal chr1p 0.48% 0.24%
chr1 941075 1387435 5.15737889167658 1 0.2196 1374631 neutral 446361 focal chr1p 0.36% 0.18%


The sample.output.seg.summary.tsv only have one row (tab-delimited):

segments merged_events amps large-scale focal dels large-scale focal amp_regions del_regions
3048 3181 1142 2 1140 826 823 3 chr9p,chr9q chr10p,chr10p,chr13q

*What really need to be careful is the column names of events.tsv. I think their order and label is totally wrong (after I compared it with the DNAcopynumber_segmented.raw.txt file).

Here is the first two rows of events.tsv and DNAcopynumber_segmented.raw.txt, respectively:


chrom chr_start chr_stop seg_mean num_segments num_markers p_value event_type event_size size_class chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%

ID chrom loc.start loc.end num.mark seg.mean bstat pval lcl ucl
1 chr1 13179 65945 10 0.6309 5.82914009485987 5.28824152243196e-08 65945 65945

The chr, chr_start, chr_stop are the same. But the seg_mean is different. I think the correct colnames for events.tsv should be:

chrom chr_start chr_stop bstat merged_seg seg_mean lcl event_type event_size chrom_arm arm_fraction chrom_fraction
chr1 13179 65945 5.82914009485987 1 0.6309 65945 amplification 52767 focal chr1p 0.04% 0.02%

The "merged_seg" is the number of events in DNAcopynumber_segmented.raw.txt merged in this file. Some called events in DNAcopynumber_segmented.raw.txt were removed, some of them were merged as one call in the events.tsv file. 



###CNVkit (tutorial)


For CNVkit, which is not covered by the two review paper I posted above. However, its performance was mentioned in CCR paper which compared the WXS+CNVkit versus the aCGH and FISH using pancreatic cancer biopsy. Since I am working on WXS of biopsy samples, I would like to try this also. 



1. Installation

Because all my bam file were stored in HPC, therefore, I will need to install this on HPC. This package was wrote with python. Therefore, need to install python with anaconda or miniconda first. I found one very useful tutorial(install python on local HPC account). After you installed the python with miniconda or anaconda, you need to install the CNVkit as described here.

In general:

# Configure the sources where conda will find packages 
conda config --add channels defaults 
conda config --add channels conda-forge 
conda config --add channels bioconda
# isntall with existing environment
conda install cnvkit


* just agree all requirement packages installation


2. Run CNVkit

In this tutorial, they provided the "copy number calling pipeline" with the cnvkit.py batch. However, I found the step by step one is easier to follow.

Terminal:


cnvkit.py access baits.bed --fasta hg19.fa -o access.hg19.bed 

*use the interval.list to define the accessible region and unaccessible region(which is called as antitarget in following)
* this will generate two file in my running: baits.target.bed and baits.antitarget.bed. However, the access.hg19.bed did not generate (I do not know why, but it does not affect the following steps)

cnvkit.py autobin *.bam -t baits.bed -g access.hg19.bed [--annotate refFlat.txt --short-names] 

*I actually did not run this step, because my access.hg19.bed did not generate properly

 # For each sample... 
cnvkit.py coverage Sample.bam baits.target.bed -o Sample.targetcoverage.cnn 
cnvkit.py coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn 

* for each bam file (including both normal and tumor), generate their target.bed and antitarget.bed file.

 # With all normal samples...
 cnvkit.py reference *Normal.{,anti}targetcoverage.cnn --fasta hg19.fa -o my_reference.cnn 

*use all normal bed file generated in last step, to build the reference file.

 # For each tumor sample... 
cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr cnvkit.py segment Sample.cnr -o Sample.cns 

*use the reference file to fix the tumor cnn, then do segmentation

 # Optionally, with --scatter and --diagram 
cnvkit.py scatter Sample.cnr -s Sample.cns -o Sample-scatter.pdf 
cnvkit.py diagram Sample.cnr -s Sample.cns -o Sample-diagram.pdf

*draw plots based on the cns file

In fact, most steps are very fast and only takes few minutes. This whole process also follow the pictures shown in tutorial.

Here I show how the data looks like in my sample:

sample.targetcoverage.cnn:

chromosome start end gene depth log2
chr1 12080 12251 - 191.018 7.57756
chr1 12595 12802 - 146.932 7.19901
chr1 13163 13658 - 263.309 8.04061
chr1 14620 15015 - 292.253 8.19107

sample.antitargetcoverage.cnn:

chromosome start end gene depth log2

chr1 150500 204324 Antitarget 17.7388 4.14884
chr1 204324 258148 Antitarget 1.32545 0.406482
chr1 259550 296887 Antitarget 0.844792 -0.243332
chr1 298007 347468 Antitarget 0 -20

sample.cnr:

chromosome start end gene log2 depth weight

chr1 12080 12251 - -0.0459285 191.018 0.503366
chr1 12595 12802 - 0.300996 146.932 0.505221
chr1 13163 13658 - 0.200901 263.309 0.606416

sample.cns:

chromosome start end gene log2 depth probes weight

chr1 12080 127570996 - -0.533832 88.0996 19188 9109.46
chr1 128799660 142787526 - -16.7501 5.35656e-05 47 21.9934
chr1 143165577 248937047 - -0.00995558 130.76 16375 7820.12
chr2 41449 93178706 - -0.0144545 114.778 10507 4877.72

The two plots looks like:
diagram plot:

 scatter plot:

###comments

In general speaking, the CNVkit is faster and easier (without jumping from linux to R and back to linux). The results generated from VarScan2 and CNVkit is quite different:
1. VarScan2 usually generated 2k~3k events (some of them are small). But CNVkit called usually <100 events.
2. VarScan3 can only call events on the regions covered by the capture kit. But CNVkit (if you follow my code) cover all regions. (Of course, you can change this in CNVkit using the -m hybrid/amplicon) 


Monday, December 21, 2015

look into clustering analysis again

Not systematically talk about the clustering analysis, just list some points out.

The R package: ConsensusclusterPlus is build mainly based on this paper: http://download.springer.com/static/pdf/906/art%253A10.1023%252FA%253A1023949509487.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1023%2FA%3A1023949509487&token2=exp=1450654590~acl=%2Fstatic%2Fpdf%2F906%2Fart%25253A10.1023%25252FA%25253A1023949509487.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1023%252FA%253A1023949509487*~hmac=37ab0a4ee103f21b202ac85751eb5fd29bfc87dca1c95858a328275772f5268b (consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data). In this paper, the two main strengths are: a. resampling-based consensus clustering; b. data visualization.
Paper talked about the traditional clustering analysis methods, such as hc, km, pam and model-based. Each method has its advantage and disadvantage. For example, hc cannot decide the number of clusters, km is the most frequently used method, pam is kind like an "advanced kmeans" methods.


1. PAM: using the real datapoint in the dataset as the center of the clustering while the kmeans can use the generated datapoint. PAM is more stable and robust to outliers and noisy.
2. The main issue in clustering analysis is to determine the number of clusters real exist in the dataset.  In this paper, author provides the consensus methods----resampling. Repeat the data for many times, then we believe the mode will be more close to the true situation. CDF value is used to evaluate the different k. CDF: empirical cumulative distribution function. As k increases, the CDF value will always increase. However, by comparing the delta increase, we will found the k which increase the CDF the most.
3. PAM could be the most useful methods to to the clustering analysis, usually silhouette is recommended to be used for evaluation. (well, in the package, they only provide the CDF value, but you can always do the silhouette plot by yourself). As a reminder, when using pam(), need to import the cluster package which is usually build in with R.
4. Then how about the distance calculation? euclidean is always the most frequently used method to calculate the distance. As I showed in previous blog, we also have manhattan, spearman, pearson, and etc. Manhattan can only go horizontal and vertical while euclidean can walk as it like.
For example, point A (0,0), point B (3,4), then the euclidean distance between AB will be 5 while the manhattan distance will be 3+4=7. It is easy to understand the spearman and pearson methods, they are calculating the correlation coefficient between every two points. Usually pearson needs to be applied to normal distributed dataset and spearman may have lower power to evaluate the relationship. I do not see any strong preference among those methods. It mainly depends on your data.
5. When applying hc, remember to calculate the distance first. The dist() in R have: euclidean""maximum""manhattan""canberra""binary" or "minkowski". daisy() is also often used to calculate the distance pairwise.
6. However, pam() and kmeans() accept the original data matrix or data frame as the input. But in these two methods, columns are feature, rows are samples. So the clustering will happens on rows/samples. Usually need to do t().








Friday, December 19, 2014

DNA methylation and CpG islands

1. What is DNA methylation?
DNA methylation is one mechanism of epigenetic. Catalyzed by DNA methyltransferase (DNMT) enzyme, one methyl group has been added to the cytosine which convert the cytosine to methylcytosine. It usually happens at the carbon 5 position. Usually, the methylcytosine will spontaneous process the deamination and become thymine.
2. What is CpG sites?
 CpG sites represent for "-cytosine-phosphate-guanine" in a linear sequence. It should be distinguished from CG as base pair.
3. What is CpG islands?
CpG islands are DNA regions with high frequency of CpG sites. Usually, methylation is rare in CpG islands. It has been largely investigated because its closely related to transcriptional activity.
4. Function of DNA methylation.
DNA methylation has suggested being able to adjust gene expression since 1980s. DNA methylation could repress the gene expression by locking the gene. However, with further understanding of its function, more biological effect has been related to DNA methylation. For example, X chromosome inactivation, genomic imprinting, preservation of chromosome stability, and embryonic development.
With the application of next generation sequencing (NGS) technologies, investigation of DNA methylation is no longer restricted to the CpG island, especially CpG island located in the gene promoter. Genome-wide DNA methylation profile become available for us. It is worth noting that not only hypermethylation but also hypomethylation has been proved to be related to diseases, especially cancer. In fact, DNA methylation and cancer is a big topic which attract large attentions.

Many review has been published on this topic:


13.          Bird AP. CpG-rich islands and the function of DNA methylation. Nature. 1985;321(6067):209-213.

14.          Phillips T. The role of methylation in gene expression. Nature Education. 2008;1(1):116.

15.          Reik W. Stability and flexibility of epigenetic gene regulation in mammalian development. Nature. May 24 2007;447(7143):425-432.

16.          Li E, Beard C, Jaenisch R. Role for DNA methylation in genomic imprinting. Nature. 1993;366(6453):362-365.

17.          Noushmehr H, Weisenberger DJ, Diefes K, et al. Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell 2010. 2010;17(5):510-522. doi: 510.1016/j.ccr.2010.1003.1017. Epub 2010 Apr 1015.

18.          Metzker ML. Sequencing technologies—the next generation. Nature Reviews Genetics. 2009;11(1):31-46.

19.          Wilhelm-Benartzi CS, Koestler DC, Karagas MR, et al. Review of processing and analysis methods for DNA methylation array data. British journal of cancer. Sep 17 2013;109(6):1394-1402.

20.          Laird PW. Principles and challenges of genome-wide DNA methylation analysis. Nature Reviews Genetics. 2010;11(3):191-203.

21.          Bock C. Analysing and interpreting DNA methylation data. Nature reviews. Genetics. Oct 2012;13(10):705-719.

22.          Ehrlich M. DNA hypomethylation in cancer cells. Epigenomics. Dec 2009;1(2):239-259.

23.          Hon GC, Hawkins RD, Caballero OL, et al. Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer. Genome research. Feb 2012;22(2):246-258.

24.          Ciriello G, Miller ML, Aksoy BA, Senbabaoglu Y, Schultz N, Sander C. Emerging landscape of oncogenic signatures across human cancers. Nature genetics. Sep 26 2013;45(10):1127-1133.

25.          Ehrlich M. DNA methylation in cancer: too much, but also too little. Oncogene. Aug 12 2002;21(35):5400-5413.

26.          Xu Z, Taylor JA. Genome-wide age-related DNA methylation changes in blood and other tissues relate to histone modification, expression and cancer. Carcinogenesis. Feb 2014;35(2):356-364.

27.          Horvath S. DNA methylation age of human tissues and cell types. Genome biology. 2013;14(10):R115.

28.          Zykovich A, Hubbard A, Flynn JM, et al. Genome-wide DNA methylation changes with age in disease-free human skeletal muscle. Aging cell. Apr 2014;13(2):360-366.

29.          Duncan CG, Barwick BG, Jin G, et al. A heterozygous IDH1R132H/WT mutation induces genome-wide alterations in DNA methylation. Genome research. Dec 2012;22(12):2339-2355.

30.          Huse JT, Aldape KD. The evolving role of molecular markers in the diagnosis and management of diffuse glioma. Clinical cancer research : an official journal of the American Association for Cancer Research. Nov 15 2014;20(22):5601-5611.

 

JSON file I/O and GDC metadata download

What is JSON? JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpl...