Tuesday, October 28, 2014

Share small R code

I would like to share some small but useful R codes and an online GISTIC tutorial. If code cited from somewhere, I marked them out.

1. Silhouette plot

#read the clusters into R, define the classification by yourself
cluster<-read.table("/Users/Documents/cluster.txt",header=T,row.names=1)

#colmun as features, row as samples
cluster<-t(cluster)

#calculate the dissimilarity and distance.
#The data has features in row, samples in column
library(cluster)
cluster.d<-daisy(t(data),metric="euclidean")

#calculate the silhouette
cluster.s<-silhouette(cluster,cluster.d)

#plot the silhouette
plot(cluster.s)

#write out the scores
write.table(cluster.s,file="/Users/Documents/cluster.s.txt",sep="\t",quote=F)

2. ConsensusClusterPlus-bioconductor R package

#it may take to sometime to install the package.
library(ConsensusClusterPlus)

#because there are many plots coming out, and the latter one will replace the former one (in my mac), so I set this, every time you need to "enter" to produce a new plot
op<-par(ask=TRUE)

#m is the data matrix with columns as samples, rows as features. maxK is the maximum K value, it will calculate from 2 to K. distance and cluster algorithm has more options, please go to tutorial link: http://www.bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf
res=ConsensusClusterPlus(m,maxK=10,reps=1000,pItem=0.8,pFeature=1,title="kmeans_cluster",distance="euclidean",clusterAlg="km",seed=123455)

#this will show the class of all samples. [[3]] means when k=3, more option about the result also refer to the link shown above.
res[[3]][["consensusClass"]]

3.  R package-pheatmap

##how to use pheatmap--a powerful package drawing heat map

##1.define the annotation of each sample, add color bar to show the predefined clusters
##create a annotation for samples.
##a data frame with the following format
## Var1
##sample1 1
##sample2 1
##sample3 2
##sample4 2
##sample5 3
##sample6 3
##…

##assign the group to each sample. and read this file into R.

annotation=read.table("pathway/annotation.txt",header=T,row.names=1)
Var1=c("maroon3","olivedrab","orange","palevioletred","royalblue","seagreen","slateblue","tomato","steelblue","yellow4","red","orchid")
names(Var1)=c("M1","M2","M3","M4","M5","M6","M7","M8","M9","M10","M11","M12")
ann_colors=list(Var1=Var1)

##2. draw heat map
data format
probe sample1 sample2 sample3 sample4
probe1 v v v v
probe2 v v v v
probe3 v v v v



pheatmap(data,annotation=annotation,annotation_colors=ann_colors,cluster_col=FALSE,fontsize=5,width=40,height=10,filename="pathway/heatmap.png",annotation_legend=FALSE,drop_levels=FALSE)

##this will set the picture as a wider one then save it in the directory specified by filename.
##If not specified, both column and row will clustered by hierarchical clustering with distance "euclidean".

4. survival analysis
(code from website quick-R: http://www.statmethods.net/advstats/glm.html)
Here using the log rank for comparison.
(Both K-M curve and log rank test are based on the assumption proportional hazard. If the compared two groups are crossed, which will violate this assumption, you cannot use log rank test any more, no matter how small is your p-value. Then you can use the non-parametric method or multivariate regression model to control some covariates.)
1. data format
obs time status group
1 23 0 1
2 34 0 1
3 44 1 2
4 99 1 2


status: 0=censored(the event does not happen)
        1=happened

2. code

data=read.table()
library(survival)
survobj<-with(data,Surv(time,status))

#draw plot without group
fit0<-survfit(survobj~1,data=data)
plot(fit0,xlab="",ylab="",yscale=100,main="")

#draw plot by group, group1 will be red, group2 will be blue.
fit1<-survfit(survobj~group,data=data)
plot(fit1,xlab="",ylab="",col=c("red","blue"),main="")
legend("topright",title="",c("male","female"),fill=c("red","blue"))

#comparison
survdiff(survobj~group,data=data)

5. Online GISTIC guide

1. go to http://genepattern.broadinstitute.org/gp/, register if you are new. Enter your email and password to login.
2. choose the modules&pipelines---modules---browse modules---choose the software you will use. You can also drag the module into the favorite modules for convince.
3. For GISTIC, two required files: segmentation file and marker file.

For segmentation file: file.seg.txt, six column with tab-delimited file.

Sample Chromosome Start End Num_Probes Segment_Mean
S1 1 1002134 2301421 2 0.04
.
.

For marker file: file.txt, three column with tab-limited file.(without header is Ok)
SNP chromosome position
SNP1 1 1002134
SNP2 1 2301421
.
.

The first column is the SNP name, if could be assigned by yourself. The second column is the chromosome number. The third column is the position.
The marker file usually obtained from the file before segmentation. It should at least include all the start and end position in your file.seg.txt. If marker file is not available, you can create one based on the positions in your seg.txt.

4. Choose the GISTIC2.0 in the modules, and drag these two required files into the designed area, then press "run".
5. After the job completed, you will get a series result for each job. You can also downloaded them as a zip file or choose the one you are interested.













Monday, October 27, 2014

Statistical issues in DNA methylation analysis(cited from nature review genetics)

Some opinions about this issue:

How to apply bioinformatics and biostatistics methods to DNA methylation data is still a tricky question. In my view, none of them is perfect, because the data obtained cannot satisfy all assumptions. But we need to remember that different platforms will produce data with different features. So choose the methods based on your platform is important.

Another thing I want to point out is: can we apply methods developed for gene expression data to our DNA methylation data? Sorry, I do not know. As mentioned in papers, the scale and assumptions are different between DNA methylation data and gene expression data. Probably, if we use the M-value to present the methylation information rather than beta-value (most commonly used), we can get more closer data features to gene expression data. For example, M-value is more normal (even though not normal enough) than beta-value; it will change from finite scale(0 to 1) to infinite scale (although it will still restricted into -6 to +6).

The following paragraphs are cited from "principles and challenges of genome-wide DNA methylation analysis". Peter.W.Larid
full paper link: http://www.nature.com/nrg/journal/v11/n3/full/nrg2732.html

Box2 An introduction to statistical issues in DNA methylation analysis
DNA samples are generally derived from mixtures of cell populations with heterogeneous DNA methylation profiles. Bisulphite sequencing-based approaches can provide a discrete DNA methylation pattern corresponding to a single original DNA molecule. However, most techniques provide an average measurement across the sampled DNA molecules for a particular locus or CpG dinucleotide. For biological and historical reasons, this is usually expressed as a fraction or percentage methylation of the total molecules assessed. For most platforms, DNA methylation measurements represent absolute measurements for a given sample, whereas gene expression measurements are usually expressed as a differential comparison between samples. Usually, strand-specific DNA methylation or hemimethylation is not considered, although monoallelic methylation — including, but not restricted to, imprinted regions — does occur in vivo. The resulting measurement scale is therefore 0 to 1, or 0 to 100%, with 0 indicating that no methylated molecules were identified and 1 or 100% indicating that all identified molecules were methylated. The fraction is calculated as M/(M + U), in which M represents the signal for methylated molecules and U the signal for unmethylated molecules.

It is important to note that this is a finite scale (β distribution) that has different statistical properties to the infinite scale that is commonly used in gene expression array analysis. For example, β-distributed DNA methylation measurements are not normally distributed and the variance of measurements within a finite scale is influenced by the mean of the measurements; the variance of measurements with a mean near to the middle of the range can be much larger than the variance of measurements with a mean close to the limits (0 and 1). Therefore, sorting features (for example, genes or probes) by standard deviation will result in a bias towards features with mean methylation in the middle of the range. Reducing the number of features by selecting probes with high standard deviation or median absolute deviation is a common step in unsupervised analyses of microarray data. Therefore, variance-stabilizing data transformations or selection of probes based on a different metric should be considered. The behaviour of the distance metrics used to compare measurements across samples (such as fold-change, log-ratio or simple subtraction) is different for β-distributed fractions or percentages in DNA methylation measurements than for infinite ratio scales. This should be given careful consideration when selecting the most appropriate metric. Clustering and partitioning methods used to identify subgroups of samples with similar DNA methylation profiles are being developed for β-distributed DNA methylation data115.

An alternative method is to report the ratio of methylated to unmethylated molecules for a particular locus (M/U), usually as a log2(M/U) ratio62, 152. This has gained acceptance with the increased use of microarrays in DNA methylation analysis62, and has the advantage that many of the tools developed for gene expression data can be readily applied. However, several points should be considered. First, many data normalization methods for gene expression data assume that many genes are not expressed and that most genes are not differentially expressed; similar assumptions cannot be made for DNA methylation measurements. More importantly, M and U are generally not independent, which violates an assumption of many of the statistical approaches for gene expression microarrays. M and U are biologically inversely correlated, but many DNA methylation platforms show them to be positively correlated if signal strength is strongly influenced by genomic location or by probe sequence — that is, if M and U are both derived from either a strongly or weakly hybridizing region. Furthermore, the M/U ratio may be inappropriate for situations in which the platform is measuring across multiple CpG dinucleotides. For example, if two CpGs are being measured and each CpG is methylated at 10%, M/U will equal 0.1 if the platform assesses each CpG independently, but will equal (0.1 × 0.1)/(0.9 × 0.9) = 0.01 if the platform only registers methylation when both CpGs are methylated and the absence of methylation when both CpGs are unmethylated. As the number of locally grouped CpGs increases, these distortions become quite pronounced.

An important distinction between DNA methylation measurements and gene expression measurements is that the total amount of CpG methylation can differ substantially among samples. Therefore, normalization methods, such as quantile normalization and LOESS normalization (which is often applied to data represented on MA plots), that assume similar total signal across samples can remove real biological signal. It is important to note that the fluorescent dyes in the Infinium DNA methylation assay do not correspond to DNA methylation states. Therefore, normalization algorithms based on dye channel comparisons cannot be applied to Infinium DNA methylation data without modification.

DNA methylation bioinformatics, biostatistics and computational biology are fertile areas of research that are under rapid development. Table 3 lists several resources for DNA methylation data analysis that are currently available.

Friday, October 10, 2014

introduction of cluster analysis

Some basic questions about cluster analysis.
1. Main approaches:
(1). Partitioning approach
- K-means
-K-medoids (also called partition around medoids)
-Model Based approaches
(2). Hierarchical approach
-agglomerative (from single element to one big group)
-divisive (divide from one big group until single element)
2. What is the difference between partitioning approach and hierarchical approach, disadvantages and advantages?
(1) Partitioning approach: divide n observations or samples into k set. You should first specify how many subgroups you want, that is the k, then minimize the distance between the center of each subgroup and the points belong to that group.
        --Advantages: numer of groups is well defined; a clear deterministic assignment of an object to a group; simple algorithms for inference*
        --Disadvantages: have to choose the number of groups; sometimes objects do not fit well to any cluster; can converge on locally optimal solutions and often require multiple restarts with random initializations*
(2) Hierarchical approach: no need to specify k, it will joint or divide one sample by one sample, finally form a dendrogram.
      --Advantages: there may be small clusters nested inside large ones; no need to specify number groups ahead of time; flexible linkage methods
      --Disadvantages: clusters might not be naturally represented by a hierarchical structure; its necessary to 'cut' the dendrogram in order to produce clusters; bottom up clustering can result in poor structure at the top of the tree, early joins cannot be "undone"
3. K-means, K-medoids, Model-based approach
The difference between K-means and K-medoids is that K-means is using the mean value as the center of each cluster, while K-medoids is using the data point as the center of each cluster.
Model-based clustering is based on distribution model. Usually it is not appropriate for non-Gaussian, high dimensional or large dataset. Houseman suggested a recursive-partitioned mixture model in 2008, which is developed for DNA methylation data analysis. The corresponding R-package is "RPMM".
(for more information about the algorithm of model-based clustering analysis: https://www.stat.washington.edu/raftery/Research/PDF/fraley2002.pdf)
(more information about RPMM: http://www.biomedcentral.com/1471-2105/9/365)
4. What matters the clustering analysis?
Partitioning methods: the algorithm to calculate the distance of the matrix; number of groups
Hierarchical methods: the algorithm to calculate the distance of the matrix; the linkage methods
5. Algorithms to calculate the distance of matrix
We treat each sample as a vector. When comparing two samples, it equals to compare two vector's distance. We want to assign the samples with small distance into one cluster, and separate samples with large distance into different clusters. (one method to evaluate the result is Ward' method in hierarchical clustering http://en.wikipedia.org/wiki/Ward's_method).
Ok, so how can we calculate the distance between samples/vectors?
(1). Euclidean
--The most popular method in DNA methylation data clustering (GCIMP identifier paper, Noushmehr, 2010, Cancer Cell).
--measure geometric distance between two vectors, consider both direction and magnitude
--not appropriate when absolute levels of functionally genes are highly different
--If using the squared euclidean distance, it will place progressively greater weight on objects that are further apart.
Assuming two vector p(p1,p2,p3…pi), q(q1,q2,q3…qi), then the distance between p and q is:

(equation and more information: http://en.wikipedia.org/wiki/Euclidean_distance)
 (2) Manhattan, also called Minkowski

(equation, picture and more information: http://en.wikipedia.org/wiki/Taxicab_geometry)
(3) 1-pearson correlation coefficient
Pearson Correlation Coefficient is evaluating the tendency of two vectors to increase or decrease together, it ranges from -1 to 1. |Pearson Correlation Coefficient| closer to 1, means better correlation; closer to 0, means worse.
--The assumption for Pearson Correlation Coefficient is the data follow normal distribution (if not satisfy this assumption, could use Spearman correlation coefficient which is using the rank of the data)
--It can only detect linear relationship and is sensitive to outliers.
(4) 1- Spearman correlation coefficient
--Robust to outliers, but less sensitive
--Small sample size may provide false positive result
--not often use
(5) hamming distance
--suitable for categorical data
Summary:
(1) A good picture to show the difference between Euclidean and Manhattan:

If both p and q are two dimensions, lines represent the distance between the two black dots. The red, blue, yellow line are Manhattan approach, the green line is Euclidean.
(2) Both 1-Pearson Correlation Coefficient and 1- Spearman Correlation Coefficient is 1-correlation format. It is proportional to Euclidean distance, but invariant to range of measurement from one sample to the next sample*.
(3) Distance matters! Different algorithm will get different result.
 These three plots are using the same data*.










(4) How to choose the right algorithm?
As I mentioned before, the clustering process is similar to compare two vectors. Gene co-expression network is also comparing samples with different gene expression: picking out those genes which have co-expression trend. Just like GSEA(gene set enrichment analysis), the 20% alteration of a gene set with similar function has more meaning than a single gene changed 200%. It is also easier to interpret. 
I think the topic about gene co-expression network will help you understand the advantages and disadvantages about those algorithms better. linkage: http://en.wikipedia.org/wiki/Gene_co-expression_network
 6. Linkage method
--Single: using the closest point.
--Complete: using the furthest point.
--Centroids: using the average value for each small group.
--Average: using all data points.

Different linkage methods will largely affect the structure of dendrogram. 
These dendrograms are using the same dataset*.
Single linkage

complete linkage








centroids linkage


average linkage

7. Determine the number of groups (the k value)
--silhouette coefficient, the higher the better, positive value means the sample stayed within the cluster well, negative value means it may closer the the "neighbor cluster". Usually, we choose the samples with positive value as the "core samples".
--BIC, AIC
--Dr. Roel G.W. Verhaak using the SigClust to evaluate the cluster significance. SigClust is designed for HDLSS data, high-dimensional, low sample size data, eg. gene expression microarray data. Each time, it compares two cluster and provide the p-value for this comparison. The assumption is normal distribution data. (Verhaak, Roel GW, et al. "Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR and NF1." Cancer cell 17.1 (2010): 98-110.)


JSON file I/O and GDC metadata download

What is JSON? JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpl...