Computational Biology: introduction of cluster analysis

Some basic questions about cluster analysis.

1. Main approaches:

(1). Partitioning approach

- K-means

-K-medoids (also called partition around medoids)

-Model Based approaches

(2). Hierarchical approach

-agglomerative (from single element to one big group)

-divisive (divide from one big group until single element)

2. What is the difference between partitioning approach and hierarchical approach, disadvantages and advantages?

(1) Partitioning approach: divide n observations or samples into k set. You should first specify how many subgroups you want, that is the k, then minimize the distance between the center of each subgroup and the points belong to that group.

--Advantages: numer of groups is well defined; a clear deterministic assignment of an object to a group; simple algorithms for inference*

--Disadvantages: have to choose the number of groups; sometimes objects do not fit well to any cluster; can converge on locally optimal solutions and often require multiple restarts with random initializations*

(2) Hierarchical approach: no need to specify k, it will joint or divide one sample by one sample, finally form a dendrogram.

--Advantages: there may be small clusters nested inside large ones; no need to specify number groups ahead of time; flexible linkage methods

--Disadvantages: clusters might not be naturally represented by a hierarchical structure; its necessary to 'cut' the dendrogram in order to produce clusters; bottom up clustering can result in poor structure at the top of the tree, early joins cannot be "undone"

3. K-means, K-medoids, Model-based approach

The difference between K-means and K-medoids is that K-means is using the mean value as the center of each cluster, while K-medoids is using the data point as the center of each cluster.

Model-based clustering is based on distribution model. Usually it is not appropriate for non-Gaussian, high dimensional or large dataset. Houseman suggested a recursive-partitioned mixture model in 2008, which is developed for DNA methylation data analysis. The corresponding R-package is "RPMM".

(for more information about the algorithm of model-based clustering analysis: https://www.stat.washington.edu/raftery/Research/PDF/fraley2002.pdf)

(more information about RPMM: http://www.biomedcentral.com/1471-2105/9/365)

4. What matters the clustering analysis?

Partitioning methods: the algorithm to calculate the distance of the matrix; number of groups

Hierarchical methods: the algorithm to calculate the distance of the matrix; the linkage methods

5. Algorithms to calculate the distance of matrix

We treat each sample as a vector. When comparing two samples, it equals to compare two vector's distance. We want to assign the samples with small distance into one cluster, and separate samples with large distance into different clusters. (one method to evaluate the result is Ward' method in hierarchical clustering http://en.wikipedia.org/wiki/Ward's_method).

Ok, so how can we calculate the distance between samples/vectors?

(1). Euclidean

--The most popular method in DNA methylation data clustering (GCIMP identifier paper, Noushmehr, 2010, Cancer Cell).

--measure geometric distance between two vectors, consider both direction and magnitude

--not appropriate when absolute levels of functionally genes are highly different

--If using the squared euclidean distance, it will place progressively greater weight on objects that are further apart.

Assuming two vector p(p1,p2,p3…pi), q(q1,q2,q3…qi), then the distance between p and q is:

(equation and more information: http://en.wikipedia.org/wiki/Euclidean_distance)

(2) Manhattan, also called Minkowski

(equation, picture and more information: http://en.wikipedia.org/wiki/Taxicab_geometry)

(3) 1-pearson correlation coefficient

Pearson Correlation Coefficient is evaluating the tendency of two vectors to increase or decrease together, it ranges from -1 to 1. |Pearson Correlation Coefficient| closer to 1, means better correlation; closer to 0, means worse.

--The assumption for Pearson Correlation Coefficient is the data follow normal distribution (if not satisfy this assumption, could use Spearman correlation coefficient which is using the rank of the data)

--It can only detect linear relationship and is sensitive to outliers.

(more information: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)

(4) 1- Spearman correlation coefficient

--Robust to outliers, but less sensitive

--Small sample size may provide false positive result

--not often use

(more information: http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)

(5) hamming distance

--suitable for categorical data

Summary:

(1) A good picture to show the difference between Euclidean and Manhattan:

If both p and q are two dimensions, lines represent the distance between the two black dots. The red, blue, yellow line are Manhattan approach, the green line is Euclidean.

(2) Both 1-Pearson Correlation Coefficient and 1- Spearman Correlation Coefficient is 1-correlation format. It is proportional to Euclidean distance, but invariant to range of measurement from one sample to the next sample*.

(3) Distance matters! Different algorithm will get different result.

These three plots are using the same data*.

(4) How to choose the right algorithm?

As I mentioned before, the clustering process is similar to compare two vectors. Gene co-expression network is also comparing samples with different gene expression: picking out those genes which have co-expression trend. Just like GSEA(gene set enrichment analysis), the 20% alteration of a gene set with similar function has more meaning than a single gene changed 200%. It is also easier to interpret.

I think the topic about gene co-expression network will help you understand the advantages and disadvantages about those algorithms better. linkage: http://en.wikipedia.org/wiki/Gene_co-expression_network

6. Linkage method

--Single: using the closest point.

--Complete: using the furthest point.

--Centroids: using the average value for each small group.

--Average: using all data points.

Different linkage methods will largely affect the structure of dendrogram.

These dendrograms are using the same dataset*.
Single linkage

complete linkage

centroids linkage

average linkage

7. Determine the number of groups (the k value)

--silhouette coefficient, the higher the better, positive value means the sample stayed within the cluster well, negative value means it may closer the the "neighbor cluster". Usually, we choose the samples with positive value as the "core samples".

--BIC, AIC

--Dr. Roel G.W. Verhaak using the SigClust to evaluate the cluster significance. SigClust is designed for HDLSS data, high-dimensional, low sample size data, eg. gene expression microarray data. Each time, it compares two cluster and provide the p-value for this comparison. The assumption is normal distribution data. (Verhaak, Roel GW, et al. "Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR and NF1." Cancer cell 17.1 (2010): 98-110.)

end

* credit should be given to Ralph Gottardo, FHCRC (http://www.rglab.org/); Boris Steipe, University of Toronto; Sohrab Shah, BC Cancer Agency; Canadian Bioinformatics Workshops(www.bioinformatics.ca)

Disclaimer: cited information has been marked, credits belong to the original author. All opinions are based on my personal understanding.

Computational Biology

Friday, October 10, 2014

introduction of cluster analysis

No comments:

Post a Comment

JSON file I/O and GDC metadata download

Jie Yang

Report Abuse