Assess a similarity measure of expression data
Introduction
It is commonly accepted that genes with similar expression profiles
are functionally related. However, there are many ways one can
measure the similarity of expression profiles, and it is not clear
a priori what is the most effective one.
This server tests different similarity measures between expression
profiles. It evaluates their effectiveness in detecting functionally
related genes and the correlation with experimentally verified
functional relationships extracted from pathway data, protein-protein
interaction data, sequence data and promoter data. Our method is
described in detail the following paper
Data sets
In our study we focus on three datasets, listed below. However, the
methodology and the tools are applicable to other data sets, and we
intend to link other data sets at a later date (contact us if you are
interested in testing your measures on other expression data sets).
- Time-series 1998 (Spellman et al. Mol Bio Cell 9:3273-3297, 1998).
The time-series data set is available to download from the
Yeast Cell Cycle Analyis Project webpage at Stanford. To make
sure that you are using the same data set we were using, we strongly
recommend that you will download our local copy. This copy contains
only the time series data, i.e. data in columns 1,2,3,4,5,6,25,50,68
(that are labeled cln3-1, cln3-2, clb, clb2-2, clb2-1, alpha, cdc15,
cdc28, and elu) is omitted. Blank entries (missing values) were
assigned -666 as a place holder (you can use your own favorite method
to handle missing data).
- Rosetta-2000 (Hughes et al. Cell 102:109-126, 2000).
Our local copy
- Stress time-series 2004
(Shapira et al. Mol Biol Cell 15:5659-5669, 2004).
Our local copy
In addition, to correlate this data set with the other data sets (sequence,
protein-protein interactions, pathways and promoters) we reordered
the rows in this file. Each line corresponds to one gene
out of the 6298 genes in the yeast genome (see
mapping from gene names to gene numbers).
To be able to process your file, it is important that you will
report the results using the gene numbers 1-6298. Other useful files:
numbers to names,
numbers to genbank (GI)
Note that some genes have no expression profiles (marked with a
vector of -666). Altogether there are 5902 genes with expression profiles
(get the list of these 5902 genes).
How to test your measure
To test your new measure of similarity, you have to download an
expression data set and compare each pair of expression profiles
(clearly, you can ignore missing expression profiles). Generate a
sorted list of pairs, in this format.
Finally, upload this list to the server. (the list doesn't have to be
sorted, although it will expedite processing if it is). Note, only
the top 20,000 similarities will be considered so you can upload a file
with only the first 20,000 pairs.
Results
Our scripts will compare your method to other existing methods and
will generate an ROC curve, as in this
figure.
|