Computational Biology: 2014

Friday, December 19, 2014

DNA methylation and CpG islands

1. What is DNA methylation?
DNA methylation is one mechanism of epigenetic. Catalyzed by DNA methyltransferase (DNMT) enzyme, one methyl group has been added to the cytosine which convert the cytosine to methylcytosine. It usually happens at the carbon 5 position. Usually, the methylcytosine will spontaneous process the deamination and become thymine.
2. What is CpG sites?
CpG sites represent for "-cytosine-phosphate-guanine" in a linear sequence. It should be distinguished from CG as base pair.
3. What is CpG islands?
CpG islands are DNA regions with high frequency of CpG sites. Usually, methylation is rare in CpG islands. It has been largely investigated because its closely related to transcriptional activity.
4. Function of DNA methylation.
DNA methylation has suggested being able to adjust gene expression since 1980s. DNA methylation could repress the gene expression by locking the gene. However, with further understanding of its function, more biological effect has been related to DNA methylation. For example, X chromosome inactivation, genomic imprinting, preservation of chromosome stability, and embryonic development.
With the application of next generation sequencing (NGS) technologies, investigation of DNA methylation is no longer restricted to the CpG island, especially CpG island located in the gene promoter. Genome-wide DNA methylation profile become available for us. It is worth noting that not only hypermethylation but also hypomethylation has been proved to be related to diseases, especially cancer. In fact, DNA methylation and cancer is a big topic which attract large attentions.

Many review has been published on this topic:

13.          Bird AP. CpG-rich islands and the function of DNA methylation. Nature. 1985;321(6067):209-213.

14.          Phillips T. The role of methylation in gene expression. Nature Education. 2008;1(1):116.

15.          Reik W. Stability and flexibility of epigenetic gene regulation in mammalian development. Nature. May 24 2007;447(7143):425-432.

16.          Li E, Beard C, Jaenisch R. Role for DNA methylation in genomic imprinting. Nature. 1993;366(6453):362-365.

17.          Noushmehr H, Weisenberger DJ, Diefes K, et al. Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell 2010. 2010;17(5):510-522. doi: 510.1016/j.ccr.2010.1003.1017. Epub 2010 Apr 1015.

18.          Metzker ML. Sequencing technologies—the next generation. Nature Reviews Genetics. 2009;11(1):31-46.

19.          Wilhelm-Benartzi CS, Koestler DC, Karagas MR, et al. Review of processing and analysis methods for DNA methylation array data. British journal of cancer. Sep 17 2013;109(6):1394-1402.

20.          Laird PW. Principles and challenges of genome-wide DNA methylation analysis. Nature Reviews Genetics. 2010;11(3):191-203.

21.          Bock C. Analysing and interpreting DNA methylation data. Nature reviews. Genetics. Oct 2012;13(10):705-719.

22.          Ehrlich M. DNA hypomethylation in cancer cells. Epigenomics. Dec 2009;1(2):239-259.

23.          Hon GC, Hawkins RD, Caballero OL, et al. Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer. Genome research. Feb 2012;22(2):246-258.

24.          Ciriello G, Miller ML, Aksoy BA, Senbabaoglu Y, Schultz N, Sander C. Emerging landscape of oncogenic signatures across human cancers. Nature genetics. Sep 26 2013;45(10):1127-1133.

25.          Ehrlich M. DNA methylation in cancer: too much, but also too little. Oncogene. Aug 12 2002;21(35):5400-5413.

26.          Xu Z, Taylor JA. Genome-wide age-related DNA methylation changes in blood and other tissues relate to histone modification, expression and cancer. Carcinogenesis. Feb 2014;35(2):356-364.

27.          Horvath S. DNA methylation age of human tissues and cell types. Genome biology. 2013;14(10):R115.

28.          Zykovich A, Hubbard A, Flynn JM, et al. Genome-wide DNA methylation changes with age in disease-free human skeletal muscle. Aging cell. Apr 2014;13(2):360-366.

29.          Duncan CG, Barwick BG, Jin G, et al. A heterozygous IDH1R132H/WT mutation induces genome-wide alterations in DNA methylation. Genome research. Dec 2012;22(12):2339-2355.

30.          Huse JT, Aldape KD. The evolving role of molecular markers in the diagnosis and management of diffuse glioma. Clinical cancer research : an official journal of the American Association for Cancer Research. Nov 15 2014;20(22):5601-5611.

Monday, November 24, 2014

python_set_functions

Table from: https://docs.python.org/2/library/sets.html

Operation	Equivalent	Result
`len(s)`		cardinality of set s
`x in s`		test x for membership in s
`x not in s`		test x for non-membership in s
`s.issubset(t)`	`s <= t`	test whether every element in s is in t
`s.issuperset(t)`	`s >= t`	test whether every element in t is in s
`s.union(t)`	`s \| t`	new set with elements from both s and t
`s.intersection(t)`	`s & t`	new set with elements common to s and t
`s.difference(t)`	`s - t`	new set with elements in s but not in t
`s.symmetric_difference(t)`	`s ^ t`	new set with elements in either s or t but not both
`s.copy()`		new set with a shallow copy of s

Monday, November 17, 2014

Glioma classification (English and Chinese version)

http://www.abta.org/brain-tumor-information/types-of-tumors/glioblastoma.html
http://www.abta.org/brain-tumor-information/types-of-tumors/glioma.html

Glioma胶质瘤
-from glial cell.

(1)glial cell 神经胶质细胞 (support tissue and keep them in position and function normally) include:
1. astrocyte 星形胶质细胞(astrocytomas星形细胞瘤, including glioblastoma/GBM胶质母细胞瘤)
2. oligodendrocyte少突胶质细胞 (oligodendrogliomas少突胶质细胞瘤)
3. ependymal cells 室管膜细胞(ependymomas室管膜瘤)
-mixed gliomas: mixture of these different cells
-Tumors like "optic nerve glioma" or "brain stem glioma" are named for location, not their tissue origin.

(2) astrocytoma has four grades (grading relies on evaluation of anaplasia, dedifferentiation and malignancy):
1. juvenile pilocytic astrocytoma毛细胞型星形细胞瘤
2. low grade astrocytoma=diffuse astrocytomas 低恶性胶质细胞瘤
3. anaplastic astrocytoma 未分化胶质细胞瘤
4. glioblastoma 胶质母细胞瘤

(3) ependymoma

(4) mixed glioma (=oligoastrocytoma): having a high proportion of more than one type of cell usually astrocytes and oligodendrocytes

(5) optic glioma: involved part of the optic pathway

(6) gliomatosis cerebri: widespread glial tumor cells in brain.

Glioblastoma胶质母细胞瘤

GBM: anaplastic, cellular gliomas composed of poorly differentiated, often pleomorphic astrocytic tumor cells with marked nuclear atypia and brisk mitotic activity.
-arise from astrocyte
-highly malignant
-contain mixed of cell types

(1) primary: most common, tend to form and make their presence known quickly.
(2) secondary: longer, slower growth history, but still very aggressive, may begin as lower grade but eventually higher grade, more found in 45s or younger, consist 10% of GBM

GBM take account of 17% primary brain tumor, 60-75% astrocytomas.

survival time: median-4.6 month, 2 year survival-30%.
MGMT methylated have longer survival time.

Tuesday, October 28, 2014

Share small R code

I would like to share some small but useful R codes and an online GISTIC tutorial. If code cited from somewhere, I marked them out.

1. Silhouette plot

#read the clusters into R, define the classification by yourself
cluster<-read.table("/Users/Documents/cluster.txt",header=T,row.names=1)

#colmun as features, row as samples
cluster<-t(cluster)

#calculate the dissimilarity and distance.
#The data has features in row, samples in column
library(cluster)
cluster.d<-daisy(t(data),metric="euclidean")

#calculate the silhouette
cluster.s<-silhouette(cluster,cluster.d)

#plot the silhouette
plot(cluster.s)

#write out the scores
write.table(cluster.s,file="/Users/Documents/cluster.s.txt",sep="\t",quote=F)

2. ConsensusClusterPlus-bioconductor R package

#it may take to sometime to install the package.
library(ConsensusClusterPlus)

#because there are many plots coming out, and the latter one will replace the former one (in my mac), so I set this, every time you need to "enter" to produce a new plot
op<-par(ask=TRUE)

#m is the data matrix with columns as samples, rows as features. maxK is the maximum K value, it will calculate from 2 to K. distance and cluster algorithm has more options, please go to tutorial link: http://www.bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf
res=ConsensusClusterPlus(m,maxK=10,reps=1000,pItem=0.8,pFeature=1,title="kmeans_cluster",distance="euclidean",clusterAlg="km",seed=123455)

#this will show the class of all samples. [[3]] means when k=3, more option about the result also refer to the link shown above.
res[[3]][["consensusClass"]]

3. R package-pheatmap

##how to use pheatmap--a powerful package drawing heat map

##1.define the annotation of each sample, add color bar to show the predefined clusters
##create a annotation for samples.
##a data frame with the following format
## Var1
##sample1 1
##sample2 1
##sample3 2
##sample4 2
##sample5 3
##sample6 3
##…

##assign the group to each sample. and read this file into R.

annotation=read.table("pathway/annotation.txt",header=T,row.names=1)
Var1=c("maroon3","olivedrab","orange","palevioletred","royalblue","seagreen","slateblue","tomato","steelblue","yellow4","red","orchid")
names(Var1)=c("M1","M2","M3","M4","M5","M6","M7","M8","M9","M10","M11","M12")
ann_colors=list(Var1=Var1)

##2. draw heat map
data format
probe sample1 sample2 sample3 sample4
probe1 v v v v
probe2 v v v v
probe3 v v v v
…

pheatmap(data,annotation=annotation,annotation_colors=ann_colors,cluster_col=FALSE,fontsize=5,width=40,height=10,filename="pathway/heatmap.png",annotation_legend=FALSE,drop_levels=FALSE)

##this will set the picture as a wider one then save it in the directory specified by filename.
##If not specified, both column and row will clustered by hierarchical clustering with distance "euclidean".

4. survival analysis
(code from website quick-R: http://www.statmethods.net/advstats/glm.html)
Here using the log rank for comparison.
(Both K-M curve and log rank test are based on the assumption proportional hazard. If the compared two groups are crossed, which will violate this assumption, you cannot use log rank test any more, no matter how small is your p-value. Then you can use the non-parametric method or multivariate regression model to control some covariates.)
1. data format
obs time status group
1 23 0 1
2 34 0 1
3 44 1 2
4 99 1 2
…

status: 0=censored(the event does not happen)
1=happened

2. code

data=read.table()
library(survival)
survobj<-with(data,Surv(time,status))

#draw plot without group
fit0<-survfit(survobj~1,data=data)
plot(fit0,xlab="",ylab="",yscale=100,main="")

#draw plot by group, group1 will be red, group2 will be blue.
fit1<-survfit(survobj~group,data=data)
plot(fit1,xlab="",ylab="",col=c("red","blue"),main="")
legend("topright",title="",c("male","female"),fill=c("red","blue"))

#comparison
survdiff(survobj~group,data=data)

5. Online GISTIC guide

1. go to http://genepattern.broadinstitute.org/gp/, register if you are new. Enter your email and password to login.
2. choose the modules&pipelines---modules---browse modules---choose the software you will use. You can also drag the module into the favorite modules for convince.
3. For GISTIC, two required files: segmentation file and marker file.

For segmentation file: file.seg.txt, six column with tab-delimited file.

Sample Chromosome Start End Num_Probes Segment_Mean
S1 1 1002134 2301421 2 0.04
.
.

For marker file: file.txt, three column with tab-limited file.(without header is Ok)
SNP chromosome position
SNP1 1 1002134
SNP2 1 2301421
.
.

The first column is the SNP name, if could be assigned by yourself. The second column is the chromosome number. The third column is the position.
The marker file usually obtained from the file before segmentation. It should at least include all the start and end position in your file.seg.txt. If marker file is not available, you can create one based on the positions in your seg.txt.

4. Choose the GISTIC2.0 in the modules, and drag these two required files into the designed area, then press "run".
5. After the job completed, you will get a series result for each job. You can also downloaded them as a zip file or choose the one you are interested.

Monday, October 27, 2014

Statistical issues in DNA methylation analysis(cited from nature review genetics)

Some opinions about this issue:

How to apply bioinformatics and biostatistics methods to DNA methylation data is still a tricky question. In my view, none of them is perfect, because the data obtained cannot satisfy all assumptions. But we need to remember that different platforms will produce data with different features. So choose the methods based on your platform is important.

Another thing I want to point out is: can we apply methods developed for gene expression data to our DNA methylation data? Sorry, I do not know. As mentioned in papers, the scale and assumptions are different between DNA methylation data and gene expression data. Probably, if we use the M-value to present the methylation information rather than beta-value (most commonly used), we can get more closer data features to gene expression data. For example, M-value is more normal (even though not normal enough) than beta-value; it will change from finite scale(0 to 1) to infinite scale (although it will still restricted into -6 to +6).

The following paragraphs are cited from "principles and challenges of genome-wide DNA methylation analysis". Peter.W.Larid
full paper link: http://www.nature.com/nrg/journal/v11/n3/full/nrg2732.html

Box2 An introduction to statistical issues in DNA methylation analysis
DNA samples are generally derived from mixtures of cell populations with heterogeneous DNA methylation profiles. Bisulphite sequencing-based approaches can provide a discrete DNA methylation pattern corresponding to a single original DNA molecule. However, most techniques provide an average measurement across the sampled DNA molecules for a particular locus or CpG dinucleotide. For biological and historical reasons, this is usually expressed as a fraction or percentage methylation of the total molecules assessed. For most platforms, DNA methylation measurements represent absolute measurements for a given sample, whereas gene expression measurements are usually expressed as a differential comparison between samples. Usually, strand-specific DNA methylation or hemimethylation is not considered, although monoallelic methylation — including, but not restricted to, imprinted regions — does occur in vivo. The resulting measurement scale is therefore 0 to 1, or 0 to 100%, with 0 indicating that no methylated molecules were identified and 1 or 100% indicating that all identified molecules were methylated. The fraction is calculated as M/(M + U), in which M represents the signal for methylated molecules and U the signal for unmethylated molecules.

It is important to note that this is a finite scale (β distribution) that has different statistical properties to the infinite scale that is commonly used in gene expression array analysis. For example, β-distributed DNA methylation measurements are not normally distributed and the variance of measurements within a finite scale is influenced by the mean of the measurements; the variance of measurements with a mean near to the middle of the range can be much larger than the variance of measurements with a mean close to the limits (0 and 1). Therefore, sorting features (for example, genes or probes) by standard deviation will result in a bias towards features with mean methylation in the middle of the range. Reducing the number of features by selecting probes with high standard deviation or median absolute deviation is a common step in unsupervised analyses of microarray data. Therefore, variance-stabilizing data transformations or selection of probes based on a different metric should be considered. The behaviour of the distance metrics used to compare measurements across samples (such as fold-change, log-ratio or simple subtraction) is different for β-distributed fractions or percentages in DNA methylation measurements than for infinite ratio scales. This should be given careful consideration when selecting the most appropriate metric. Clustering and partitioning methods used to identify subgroups of samples with similar DNA methylation profiles are being developed for β-distributed DNA methylation data115.

An alternative method is to report the ratio of methylated to unmethylated molecules for a particular locus (M/U), usually as a log2(M/U) ratio62, 152. This has gained acceptance with the increased use of microarrays in DNA methylation analysis62, and has the advantage that many of the tools developed for gene expression data can be readily applied. However, several points should be considered. First, many data normalization methods for gene expression data assume that many genes are not expressed and that most genes are not differentially expressed; similar assumptions cannot be made for DNA methylation measurements. More importantly, M and U are generally not independent, which violates an assumption of many of the statistical approaches for gene expression microarrays. M and U are biologically inversely correlated, but many DNA methylation platforms show them to be positively correlated if signal strength is strongly influenced by genomic location or by probe sequence — that is, if M and U are both derived from either a strongly or weakly hybridizing region. Furthermore, the M/U ratio may be inappropriate for situations in which the platform is measuring across multiple CpG dinucleotides. For example, if two CpGs are being measured and each CpG is methylated at 10%, M/U will equal 0.1 if the platform assesses each CpG independently, but will equal (0.1 × 0.1)/(0.9 × 0.9) = 0.01 if the platform only registers methylation when both CpGs are methylated and the absence of methylation when both CpGs are unmethylated. As the number of locally grouped CpGs increases, these distortions become quite pronounced.

An important distinction between DNA methylation measurements and gene expression measurements is that the total amount of CpG methylation can differ substantially among samples. Therefore, normalization methods, such as quantile normalization and LOESS normalization (which is often applied to data represented on MA plots), that assume similar total signal across samples can remove real biological signal. It is important to note that the fluorescent dyes in the Infinium DNA methylation assay do not correspond to DNA methylation states. Therefore, normalization algorithms based on dye channel comparisons cannot be applied to Infinium DNA methylation data without modification.

DNA methylation bioinformatics, biostatistics and computational biology are fertile areas of research that are under rapid development. Table 3 lists several resources for DNA methylation data analysis that are currently available.

Friday, October 10, 2014

introduction of cluster analysis

Some basic questions about cluster analysis.

1. Main approaches:

(1). Partitioning approach

- K-means

-K-medoids (also called partition around medoids)

-Model Based approaches

(2). Hierarchical approach

-agglomerative (from single element to one big group)

-divisive (divide from one big group until single element)

2. What is the difference between partitioning approach and hierarchical approach, disadvantages and advantages?

(1) Partitioning approach: divide n observations or samples into k set. You should first specify how many subgroups you want, that is the k, then minimize the distance between the center of each subgroup and the points belong to that group.

--Advantages: numer of groups is well defined; a clear deterministic assignment of an object to a group; simple algorithms for inference*

--Disadvantages: have to choose the number of groups; sometimes objects do not fit well to any cluster; can converge on locally optimal solutions and often require multiple restarts with random initializations*

(2) Hierarchical approach: no need to specify k, it will joint or divide one sample by one sample, finally form a dendrogram.

--Advantages: there may be small clusters nested inside large ones; no need to specify number groups ahead of time; flexible linkage methods

--Disadvantages: clusters might not be naturally represented by a hierarchical structure; its necessary to 'cut' the dendrogram in order to produce clusters; bottom up clustering can result in poor structure at the top of the tree, early joins cannot be "undone"

3. K-means, K-medoids, Model-based approach

The difference between K-means and K-medoids is that K-means is using the mean value as the center of each cluster, while K-medoids is using the data point as the center of each cluster.

Model-based clustering is based on distribution model. Usually it is not appropriate for non-Gaussian, high dimensional or large dataset. Houseman suggested a recursive-partitioned mixture model in 2008, which is developed for DNA methylation data analysis. The corresponding R-package is "RPMM".

(for more information about the algorithm of model-based clustering analysis: https://www.stat.washington.edu/raftery/Research/PDF/fraley2002.pdf)

(more information about RPMM: http://www.biomedcentral.com/1471-2105/9/365)

4. What matters the clustering analysis?

Partitioning methods: the algorithm to calculate the distance of the matrix; number of groups

Hierarchical methods: the algorithm to calculate the distance of the matrix; the linkage methods

5. Algorithms to calculate the distance of matrix

We treat each sample as a vector. When comparing two samples, it equals to compare two vector's distance. We want to assign the samples with small distance into one cluster, and separate samples with large distance into different clusters. (one method to evaluate the result is Ward' method in hierarchical clustering http://en.wikipedia.org/wiki/Ward's_method).

Ok, so how can we calculate the distance between samples/vectors?

(1). Euclidean

--The most popular method in DNA methylation data clustering (GCIMP identifier paper, Noushmehr, 2010, Cancer Cell).

--measure geometric distance between two vectors, consider both direction and magnitude

--not appropriate when absolute levels of functionally genes are highly different

--If using the squared euclidean distance, it will place progressively greater weight on objects that are further apart.

Assuming two vector p(p1,p2,p3…pi), q(q1,q2,q3…qi), then the distance between p and q is:

(equation and more information: http://en.wikipedia.org/wiki/Euclidean_distance)

(2) Manhattan, also called Minkowski

(equation, picture and more information: http://en.wikipedia.org/wiki/Taxicab_geometry)

(3) 1-pearson correlation coefficient

Pearson Correlation Coefficient is evaluating the tendency of two vectors to increase or decrease together, it ranges from -1 to 1. |Pearson Correlation Coefficient| closer to 1, means better correlation; closer to 0, means worse.

--The assumption for Pearson Correlation Coefficient is the data follow normal distribution (if not satisfy this assumption, could use Spearman correlation coefficient which is using the rank of the data)

--It can only detect linear relationship and is sensitive to outliers.

(more information: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)

(4) 1- Spearman correlation coefficient

--Robust to outliers, but less sensitive

--Small sample size may provide false positive result

--not often use

(more information: http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)

(5) hamming distance

--suitable for categorical data

Summary:

(1) A good picture to show the difference between Euclidean and Manhattan:

If both p and q are two dimensions, lines represent the distance between the two black dots. The red, blue, yellow line are Manhattan approach, the green line is Euclidean.

(2) Both 1-Pearson Correlation Coefficient and 1- Spearman Correlation Coefficient is 1-correlation format. It is proportional to Euclidean distance, but invariant to range of measurement from one sample to the next sample*.

(3) Distance matters! Different algorithm will get different result.

These three plots are using the same data*.

(4) How to choose the right algorithm?

As I mentioned before, the clustering process is similar to compare two vectors. Gene co-expression network is also comparing samples with different gene expression: picking out those genes which have co-expression trend. Just like GSEA(gene set enrichment analysis), the 20% alteration of a gene set with similar function has more meaning than a single gene changed 200%. It is also easier to interpret.

I think the topic about gene co-expression network will help you understand the advantages and disadvantages about those algorithms better. linkage: http://en.wikipedia.org/wiki/Gene_co-expression_network

6. Linkage method

--Single: using the closest point.

--Complete: using the furthest point.

--Centroids: using the average value for each small group.

--Average: using all data points.

Different linkage methods will largely affect the structure of dendrogram.

These dendrograms are using the same dataset*.
Single linkage

complete linkage

centroids linkage

average linkage

7. Determine the number of groups (the k value)

--silhouette coefficient, the higher the better, positive value means the sample stayed within the cluster well, negative value means it may closer the the "neighbor cluster". Usually, we choose the samples with positive value as the "core samples".

--BIC, AIC

--Dr. Roel G.W. Verhaak using the SigClust to evaluate the cluster significance. SigClust is designed for HDLSS data, high-dimensional, low sample size data, eg. gene expression microarray data. Each time, it compares two cluster and provide the p-value for this comparison. The assumption is normal distribution data. (Verhaak, Roel GW, et al. "Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR and NF1." Cancer cell 17.1 (2010): 98-110.)

Computational Biology