Computational Biology: Share small R code

I would like to share some small but useful R codes and an online GISTIC tutorial. If code cited from somewhere, I marked them out.

1. Silhouette plot

#read the clusters into R, define the classification by yourself
cluster<-read.table("/Users/Documents/cluster.txt",header=T,row.names=1)

#colmun as features, row as samples
cluster<-t(cluster)

#calculate the dissimilarity and distance.
#The data has features in row, samples in column
library(cluster)
cluster.d<-daisy(t(data),metric="euclidean")

#calculate the silhouette
cluster.s<-silhouette(cluster,cluster.d)

#plot the silhouette
plot(cluster.s)

#write out the scores
write.table(cluster.s,file="/Users/Documents/cluster.s.txt",sep="\t",quote=F)

2. ConsensusClusterPlus-bioconductor R package

#it may take to sometime to install the package.
library(ConsensusClusterPlus)

#because there are many plots coming out, and the latter one will replace the former one (in my mac), so I set this, every time you need to "enter" to produce a new plot
op<-par(ask=TRUE)

#m is the data matrix with columns as samples, rows as features. maxK is the maximum K value, it will calculate from 2 to K. distance and cluster algorithm has more options, please go to tutorial link: http://www.bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf
res=ConsensusClusterPlus(m,maxK=10,reps=1000,pItem=0.8,pFeature=1,title="kmeans_cluster",distance="euclidean",clusterAlg="km",seed=123455)

#this will show the class of all samples. [[3]] means when k=3, more option about the result also refer to the link shown above.
res[[3]][["consensusClass"]]

3. R package-pheatmap

##how to use pheatmap--a powerful package drawing heat map

##1.define the annotation of each sample, add color bar to show the predefined clusters
##create a annotation for samples.
##a data frame with the following format
## Var1
##sample1 1
##sample2 1
##sample3 2
##sample4 2
##sample5 3
##sample6 3
##…

##assign the group to each sample. and read this file into R.

annotation=read.table("pathway/annotation.txt",header=T,row.names=1)
Var1=c("maroon3","olivedrab","orange","palevioletred","royalblue","seagreen","slateblue","tomato","steelblue","yellow4","red","orchid")
names(Var1)=c("M1","M2","M3","M4","M5","M6","M7","M8","M9","M10","M11","M12")
ann_colors=list(Var1=Var1)

##2. draw heat map
data format
probe sample1 sample2 sample3 sample4
probe1 v v v v
probe2 v v v v
probe3 v v v v
…

pheatmap(data,annotation=annotation,annotation_colors=ann_colors,cluster_col=FALSE,fontsize=5,width=40,height=10,filename="pathway/heatmap.png",annotation_legend=FALSE,drop_levels=FALSE)

##this will set the picture as a wider one then save it in the directory specified by filename.
##If not specified, both column and row will clustered by hierarchical clustering with distance "euclidean".

4. survival analysis
(code from website quick-R: http://www.statmethods.net/advstats/glm.html)
Here using the log rank for comparison.
(Both K-M curve and log rank test are based on the assumption proportional hazard. If the compared two groups are crossed, which will violate this assumption, you cannot use log rank test any more, no matter how small is your p-value. Then you can use the non-parametric method or multivariate regression model to control some covariates.)
1. data format
obs time status group
1 23 0 1
2 34 0 1
3 44 1 2
4 99 1 2
…

status: 0=censored(the event does not happen)
1=happened

2. code

data=read.table()
library(survival)
survobj<-with(data,Surv(time,status))

#draw plot without group
fit0<-survfit(survobj~1,data=data)
plot(fit0,xlab="",ylab="",yscale=100,main="")

#draw plot by group, group1 will be red, group2 will be blue.
fit1<-survfit(survobj~group,data=data)
plot(fit1,xlab="",ylab="",col=c("red","blue"),main="")
legend("topright",title="",c("male","female"),fill=c("red","blue"))

#comparison
survdiff(survobj~group,data=data)

5. Online GISTIC guide

1. go to http://genepattern.broadinstitute.org/gp/, register if you are new. Enter your email and password to login.
2. choose the modules&pipelines---modules---browse modules---choose the software you will use. You can also drag the module into the favorite modules for convince.
3. For GISTIC, two required files: segmentation file and marker file.

For segmentation file: file.seg.txt, six column with tab-delimited file.

Sample Chromosome Start End Num_Probes Segment_Mean
S1 1 1002134 2301421 2 0.04
.
.

For marker file: file.txt, three column with tab-limited file.(without header is Ok)
SNP chromosome position
SNP1 1 1002134
SNP2 1 2301421
.
.

The first column is the SNP name, if could be assigned by yourself. The second column is the chromosome number. The third column is the position.
The marker file usually obtained from the file before segmentation. It should at least include all the start and end position in your file.seg.txt. If marker file is not available, you can create one based on the positions in your seg.txt.

4. Choose the GISTIC2.0 in the modules, and drag these two required files into the designed area, then press "run".
5. After the job completed, you will get a series result for each job. You can also downloaded them as a zip file or choose the one you are interested.

Computational Biology

Tuesday, October 28, 2014

Share small R code

No comments:

Post a Comment

JSON file I/O and GDC metadata download

Jie Yang

Report Abuse