Monday, December 21, 2015

look into clustering analysis again

Not systematically talk about the clustering analysis, just list some points out.

The R package: ConsensusclusterPlus is build mainly based on this paper: http://download.springer.com/static/pdf/906/art%253A10.1023%252FA%253A1023949509487.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1023%2FA%3A1023949509487&token2=exp=1450654590~acl=%2Fstatic%2Fpdf%2F906%2Fart%25253A10.1023%25252FA%25253A1023949509487.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1023%252FA%253A1023949509487*~hmac=37ab0a4ee103f21b202ac85751eb5fd29bfc87dca1c95858a328275772f5268b (consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data). In this paper, the two main strengths are: a. resampling-based consensus clustering; b. data visualization.
Paper talked about the traditional clustering analysis methods, such as hc, km, pam and model-based. Each method has its advantage and disadvantage. For example, hc cannot decide the number of clusters, km is the most frequently used method, pam is kind like an "advanced kmeans" methods.


1. PAM: using the real datapoint in the dataset as the center of the clustering while the kmeans can use the generated datapoint. PAM is more stable and robust to outliers and noisy.
2. The main issue in clustering analysis is to determine the number of clusters real exist in the dataset.  In this paper, author provides the consensus methods----resampling. Repeat the data for many times, then we believe the mode will be more close to the true situation. CDF value is used to evaluate the different k. CDF: empirical cumulative distribution function. As k increases, the CDF value will always increase. However, by comparing the delta increase, we will found the k which increase the CDF the most.
3. PAM could be the most useful methods to to the clustering analysis, usually silhouette is recommended to be used for evaluation. (well, in the package, they only provide the CDF value, but you can always do the silhouette plot by yourself). As a reminder, when using pam(), need to import the cluster package which is usually build in with R.
4. Then how about the distance calculation? euclidean is always the most frequently used method to calculate the distance. As I showed in previous blog, we also have manhattan, spearman, pearson, and etc. Manhattan can only go horizontal and vertical while euclidean can walk as it like.
For example, point A (0,0), point B (3,4), then the euclidean distance between AB will be 5 while the manhattan distance will be 3+4=7. It is easy to understand the spearman and pearson methods, they are calculating the correlation coefficient between every two points. Usually pearson needs to be applied to normal distributed dataset and spearman may have lower power to evaluate the relationship. I do not see any strong preference among those methods. It mainly depends on your data.
5. When applying hc, remember to calculate the distance first. The dist() in R have: euclidean""maximum""manhattan""canberra""binary" or "minkowski". daisy() is also often used to calculate the distance pairwise.
6. However, pam() and kmeans() accept the original data matrix or data frame as the input. But in these two methods, columns are feature, rows are samples. So the clustering will happens on rows/samples. Usually need to do t().








JSON file I/O and GDC metadata download

What is JSON? JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpl...