Some opinions about this issue:
How to apply bioinformatics and biostatistics methods to DNA methylation data is still a tricky question. In my view, none of them is perfect, because the data obtained cannot satisfy all assumptions. But we need to remember that different platforms will produce data with different features. So choose the methods based on your platform is important.
Another thing I want to point out is: can we apply methods developed for gene expression data to our DNA methylation data? Sorry, I do not know. As mentioned in papers, the scale and assumptions are different between DNA methylation data and gene expression data. Probably, if we use the M-value to present the methylation information rather than beta-value (most commonly used), we can get more closer data features to gene expression data. For example, M-value is more normal (even though not normal enough) than beta-value; it will change from finite scale(0 to 1) to infinite scale (although it will still restricted into -6 to +6).
The following paragraphs are cited from "principles and challenges of genome-wide DNA methylation analysis". Peter.W.Larid
full paper link: http://www.nature.com/nrg/journal/v11/n3/full/nrg2732.html
Box2 An introduction to statistical issues in DNA methylation analysis
DNA samples are generally derived from mixtures of cell populations with heterogeneous DNA methylation profiles. Bisulphite sequencing-based approaches can provide a discrete DNA methylation pattern corresponding to a single original DNA molecule. However, most techniques provide an average measurement across the sampled DNA molecules for a particular locus or CpG dinucleotide. For biological and historical reasons, this is usually expressed as a fraction or percentage methylation of the total molecules assessed. For most platforms, DNA methylation measurements represent absolute measurements for a given sample, whereas gene expression measurements are usually expressed as a differential comparison between samples. Usually, strand-specific DNA methylation or hemimethylation is not considered, although monoallelic methylation — including, but not restricted to, imprinted regions — does occur in vivo. The resulting measurement scale is therefore 0 to 1, or 0 to 100%, with 0 indicating that no methylated molecules were identified and 1 or 100% indicating that all identified molecules were methylated. The fraction is calculated as M/(M + U), in which M represents the signal for methylated molecules and U the signal for unmethylated molecules.
It is important to note that this is a finite scale (β distribution) that has different statistical properties to the infinite scale that is commonly used in gene expression array analysis. For example, β-distributed DNA methylation measurements are not normally distributed and the variance of measurements within a finite scale is influenced by the mean of the measurements; the variance of measurements with a mean near to the middle of the range can be much larger than the variance of measurements with a mean close to the limits (0 and 1). Therefore, sorting features (for example, genes or probes) by standard deviation will result in a bias towards features with mean methylation in the middle of the range. Reducing the number of features by selecting probes with high standard deviation or median absolute deviation is a common step in unsupervised analyses of microarray data. Therefore, variance-stabilizing data transformations or selection of probes based on a different metric should be considered. The behaviour of the distance metrics used to compare measurements across samples (such as fold-change, log-ratio or simple subtraction) is different for β-distributed fractions or percentages in DNA methylation measurements than for infinite ratio scales. This should be given careful consideration when selecting the most appropriate metric. Clustering and partitioning methods used to identify subgroups of samples with similar DNA methylation profiles are being developed for β-distributed DNA methylation data115.
An alternative method is to report the ratio of methylated to unmethylated molecules for a particular locus (M/U), usually as a log2(M/U) ratio62, 152. This has gained acceptance with the increased use of microarrays in DNA methylation analysis62, and has the advantage that many of the tools developed for gene expression data can be readily applied. However, several points should be considered. First, many data normalization methods for gene expression data assume that many genes are not expressed and that most genes are not differentially expressed; similar assumptions cannot be made for DNA methylation measurements. More importantly, M and U are generally not independent, which violates an assumption of many of the statistical approaches for gene expression microarrays. M and U are biologically inversely correlated, but many DNA methylation platforms show them to be positively correlated if signal strength is strongly influenced by genomic location or by probe sequence — that is, if M and U are both derived from either a strongly or weakly hybridizing region. Furthermore, the M/U ratio may be inappropriate for situations in which the platform is measuring across multiple CpG dinucleotides. For example, if two CpGs are being measured and each CpG is methylated at 10%, M/U will equal 0.1 if the platform assesses each CpG independently, but will equal (0.1 × 0.1)/(0.9 × 0.9) = 0.01 if the platform only registers methylation when both CpGs are methylated and the absence of methylation when both CpGs are unmethylated. As the number of locally grouped CpGs increases, these distortions become quite pronounced.
An important distinction between DNA methylation measurements and gene expression measurements is that the total amount of CpG methylation can differ substantially among samples. Therefore, normalization methods, such as quantile normalization and LOESS normalization (which is often applied to data represented on MA plots), that assume similar total signal across samples can remove real biological signal. It is important to note that the fluorescent dyes in the Infinium DNA methylation assay do not correspond to DNA methylation states. Therefore, normalization algorithms based on dye channel comparisons cannot be applied to Infinium DNA methylation data without modification.
DNA methylation bioinformatics, biostatistics and computational biology are fertile areas of research that are under rapid development. Table 3 lists several resources for DNA methylation data analysis that are currently available.
Subscribe to:
Post Comments (Atom)
JSON file I/O and GDC metadata download
What is JSON? JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpl...
-
Foreword: We can use either WGS or WXS to detect CNV by many different algorithms. In general speaking, the WGS is more reliable since WXS...
-
Some basic questions about cluster analysis. 1. Main approaches: (1). Partitioning approach - K-means -K-medoids (also called parti...
-
What is JSON? JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpl...
No comments:
Post a Comment