Module Networks: Discovering Regulatory Modules and their
Condition Specific Regulators from Gene Expression Data
|
E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, N. Friedman
Nature Genetics, 2003 June, 34(2): 166-76
Much of a cell's activity is organized as a network of interacting modules: sets of genes co-regulated to respond to different conditions. We present a probabilistic method for discovering regulatory modules from gene expression data. Our procedure identifies modules of co-regulated genes, their regulators, and the conditions under which regulation occurs, generating testable hypotheses in the form "regulator 'X' regulates process 'Y' under conditions 'W'". We applied the method to a Saccharomyces cerevisiae expression dataset, demonstrating its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.
|
Full Paper
Suppl.
|
GeneXPress: a Tool for Statistical Analysis and Visualization of Genomic Data
|
E. Segal, R. Yelensky, A. Kaushal, T. Pham, A. Regev, N. Friedman, D. Koller
|
Download
|
Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression
|
E. Segal, R. Yelensky, D. Koller
Bioinformatics, 2003; 19 Suppl 1:
Recipient of the Best Paper Award at the 11th Inter. Conf. on Intelligent Systems for Molecular Biology (ISMB)
In this paper, we describe an approach
for understanding transcriptional regulation from both gene expression
and promoter sequence data. We aim to identify transcriptional
modules --- sets of genes that are co-regulated in a set of
experiments, through a common motif profile. Using the EM
algorithm, our approach refines both the module assignment and the
motif profile so as to best explain the expression data as a function
of transcriptional motifs. It also dynamically adds and deletes
motifs, as required to provide a genome-wide explanation of the
expression data. We evaluate the method on two S. Cerevisiae
gene expression data sets, showing that our approach is better than a
standard one at recovering known motifs and at generating biologically
coherent modules. We also combine our results with binding
localization data to obtain regulatory relationships
with known transcription factors, and show that many of the inferred
relationships have support in the literature.
|
Full Paper
|
Discovering Molecular Pathways from Protein Interaction and Gene Expression Data
|
E. Segal, H. Wang, D. Koller
Bioinformatics, 2003; 19 Suppl 1:
Recipient of the Best Student Paper Award at the 11th Inter. Conf. on Intelligent Systems for Molecular Biology (ISMB)
In this paper, we describe an approach for identifying ``pathways''
from gene expression and protein interaction data. Our approach is
based on the assumption that many pathways exhibit two properties:
their genes exhibit a similar gene expression profile, and the protein
products of the genes often interact. Our approach is based on a
unified probabilistic model, which is learned from the data using the
EM algorithm. We present results on two S. Cerevisiae gene
expression data sets, combined with a binary protein interaction data
set. Our results show that our approach is much more successful than
other approaches at discovering both coherent functional groups and
entire protein complexes.
|
Full Paper
|
Decomposing Gene Expression into Cellular Processes
|
E. Segal, A. Battle, D. Koller
In Proceedings of the 8th Pacific Symposium on Biocomputing (PSB), Kaua'i, January 2003
We propose a probabilistic model for cellular processes, and an
algorithm for discovering them from gene expression data. A
process is associated with a set of genes that participate in it;
unlike clustering techniques, our model allows genes to participate in
multiple processes. Each process may be active to a different degree in
each experiment. The expression measurement for gene g in array a is
a sum, over all processes in which g participates, of the activity
levels of these processes in array a. We describe an iterative
procedure, based on the EM algorithm, for decomposing the expression
matrix into a given number of processes. We present results on
Yeast gene expression data, which indicate that our approach identifies
real biological processes.
|
Full Paper
|
From Promoter Sequence to Expression: a Probabilistic Framework
|
E. Segal, Y. Barash, I. Simon, N. Friedman, D. Koller
In Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB),
Washington, DC, April 2002
We present a probabilistic framework that models the process by which
transcriptional binding explains the mRNA expression of different
genes. Our joint probabilistic model unifies the two key components of this
process: the prediction of gene regulation events from sequence motifs in
the gene's promoter region, and the prediction of mRNA expression from
combinations of gene regulation events in different settings. Our
approach has several advantages. By learning promoter sequence motifs
that are directly predictive of expression data, it can improve the
identification of binding site patterns. It is also able to identify
combinatorial regulation via interactions of different transcription
factors. Finally, the general framework allows us to integrate
additional data sources, including data from the recent binding
localization assays. We demonstrate our approach on the cell cycle
data of Spellman et al., combined with the binding localization
information of Simon et al.. We show that the learned model
predicts expression from sequence, and that it identifies coherent
co-regulated groups with significant transcription factor motifs. It
also provides valuable biological insight into the domain via these
co-regulated ``modules'' and the combinatorial regulation effects that
govern their behavior.
|
Full Paper
|
Probablistic Hierarchical Clustering for Biological Data
|
E. Segal, D. Koller
In Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB),
Washington, DC, April 2002
Biological data, such as gene expression profiles or protein sequences, is
often organized in a hierarchy of classes, where the instances assigned to
``nearby'' classes in the tree are similar. Most approaches for
constructing a hierarchy use simple local operations, that are very
sensitive to noise or variation in the data. In this paper, we describe
probabilistic abstraction hierarchies (PAH), a
general probabilistic framework for clustering data into a hierarchy, and
show how it can be applied to a wide variety of biological data sets. In a
PAH, each class is associated with a probabilistic generative model for the
data in the class. The PAH clustering algorithm simultaneously optimizes
three things: the assignment of data instances to clusters, the models
associated with the clusters, and the structure of the abstraction
hierarchy. A unique feature of the PAH approach is that it utilizes global
optimization algorithms for the last two steps, substantially reducing the
sensitivity to noise and the propensity to local maxima. We show how to
apply this framework to gene expression data, protein sequence data, and HIV
protease sequence data. We also show how our framework supports hierarchies
involving more than one type of data. We demonstrate that our method
extracts useful biological knowledge and is substantially more robust than
hierarchical agglomerative clustering.
|
Full Paper
|
Rich Probabilistic Models for Gene Expression
|
E. Segal, B. Taskar, A. Gasch, N. Friedman, D. Koller
Bioinformatics, 2003; 17 Suppl 1:S243-252
Clustering is commonly used for analyzing gene expression data.
Despite their successes, clustering methods suffer from a number of
limitations. First, these methods reveal similarities that exist over
all of the measurements, while obscuring relationships that exist over
only a subset of the data. Second, clustering methods cannot readily
incorporate additional types of information, such as clinical data or
known attributes of genes.
To circumvent these shortcomings, we propose the use of a single coherent
probabilistic model, that encompasses much of the rich structure in
the genomic expression data, while incorporating additional
information such as experiment type, putative binding
sites, or functional information.
We show how this model can be learned from the data, allowing us to discover
patterns in the data and
dependencies between the gene expression patterns and additional
attributes.
The learned model reveals context-specific relationships, that exist
only over a subset of the experiments in the dataset.
We demonstrate the power of our approach on
synthetic data and on two real-world gene expression data sets for yeast.
For example, we demonstrate a novel functionality that falls
naturally out of our framework: predicting the ``cluster'' of the array
resulting from a gene mutation based only on the gene's expression
pattern in the context of other mutations.
|
Full Paper
|
|
|