Learning Models of Genomic Data

Ph.D. student: Eran Segal
Advisor: Daphne Koller
Other students:
Alexis Battle
Amit Kaushal
Richard May
Tuan Pham
Romain Thibaux
Haidong Wang
Roman Yelensky
Yoseph Barash
Asa Ben-Hur
David Botstein
Doug Brutlag
Nir Friedman
Audrey Gasch
Stuart Kim
          Aviv Regev
Matt Scott
Itamar Simon
Michael Shapira
Joshua Stuart
Roman Yelensky
Dana Pe'er
Module Networks: Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data
E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, N. Friedman
Nature Genetics, 2003 June, 34(2): 166-76

Much of a cell's activity is organized as a network of interacting modules: sets of genes co-regulated to respond to different conditions. We present a probabilistic method for discovering regulatory modules from gene expression data. Our procedure identifies modules of co-regulated genes, their regulators, and the conditions under which regulation occurs, generating testable hypotheses in the form "regulator 'X' regulates process 'Y' under conditions 'W'". We applied the method to a Saccharomyces cerevisiae expression dataset, demonstrating its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.
Full Paper Suppl.

GeneXPress: a Tool for Statistical Analysis and Visualization of Genomic Data
E. Segal, R. Yelensky, A. Kaushal, T. Pham, A. Regev, N. Friedman, D. Koller                    

Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression
E. Segal, R. Yelensky, D. Koller
Bioinformatics, 2003; 19 Suppl 1:
Recipient of the Best Paper Award at the 11th Inter. Conf. on Intelligent Systems for Molecular Biology (ISMB)

In this paper, we describe an approach for understanding transcriptional regulation from both gene expression and promoter sequence data. We aim to identify transcriptional modules --- sets of genes that are co-regulated in a set of experiments, through a common motif profile. Using the EM algorithm, our approach refines both the module assignment and the motif profile so as to best explain the expression data as a function of transcriptional motifs. It also dynamically adds and deletes motifs, as required to provide a genome-wide explanation of the expression data. We evaluate the method on two S. Cerevisiae gene expression data sets, showing that our approach is better than a standard one at recovering known motifs and at generating biologically coherent modules. We also combine our results with binding localization data to obtain regulatory relationships with known transcription factors, and show that many of the inferred relationships have support in the literature.
Full Paper

Discovering Molecular Pathways from Protein Interaction and Gene Expression Data
E. Segal, H. Wang, D. Koller
Bioinformatics, 2003; 19 Suppl 1:
Recipient of the Best Student Paper Award at the 11th Inter. Conf. on Intelligent Systems for Molecular Biology (ISMB)

In this paper, we describe an approach for identifying ``pathways'' from gene expression and protein interaction data. Our approach is based on the assumption that many pathways exhibit two properties: their genes exhibit a similar gene expression profile, and the protein products of the genes often interact. Our approach is based on a unified probabilistic model, which is learned from the data using the EM algorithm. We present results on two S. Cerevisiae gene expression data sets, combined with a binary protein interaction data set. Our results show that our approach is much more successful than other approaches at discovering both coherent functional groups and entire protein complexes.
Full Paper

Decomposing Gene Expression into Cellular Processes
E. Segal, A. Battle, D. Koller
In Proceedings of the 8th Pacific Symposium on Biocomputing (PSB), Kaua'i, January 2003

We propose a probabilistic model for cellular processes, and an algorithm for discovering them from gene expression data. A process is associated with a set of genes that participate in it; unlike clustering techniques, our model allows genes to participate in multiple processes. Each process may be active to a different degree in each experiment. The expression measurement for gene g in array a is a sum, over all processes in which g participates, of the activity levels of these processes in array a. We describe an iterative procedure, based on the EM algorithm, for decomposing the expression matrix into a given number of processes. We present results on Yeast gene expression data, which indicate that our approach identifies real biological processes.
Full Paper

From Promoter Sequence to Expression: a Probabilistic Framework
E. Segal, Y. Barash, I. Simon, N. Friedman, D. Koller
In Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB), Washington, DC, April 2002

We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's promoter region, and the prediction of mRNA expression from combinations of gene regulation events in different settings. Our approach has several advantages. By learning promoter sequence motifs that are directly predictive of expression data, it can improve the identification of binding site patterns. It is also able to identify combinatorial regulation via interactions of different transcription factors. Finally, the general framework allows us to integrate additional data sources, including data from the recent binding localization assays. We demonstrate our approach on the cell cycle data of Spellman et al., combined with the binding localization information of Simon et al.. We show that the learned model predicts expression from sequence, and that it identifies coherent co-regulated groups with significant transcription factor motifs. It also provides valuable biological insight into the domain via these co-regulated ``modules'' and the combinatorial regulation effects that govern their behavior.
Full Paper

Probablistic Hierarchical Clustering for Biological Data
E. Segal, D. Koller
In Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB), Washington, DC, April 2002

Biological data, such as gene expression profiles or protein sequences, is often organized in a hierarchy of classes, where the instances assigned to ``nearby'' classes in the tree are similar. Most approaches for constructing a hierarchy use simple local operations, that are very sensitive to noise or variation in the data. In this paper, we describe probabilistic abstraction hierarchies (PAH), a general probabilistic framework for clustering data into a hierarchy, and show how it can be applied to a wide variety of biological data sets. In a PAH, each class is associated with a probabilistic generative model for the data in the class. The PAH clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of the PAH approach is that it utilizes global optimization algorithms for the last two steps, substantially reducing the sensitivity to noise and the propensity to local maxima. We show how to apply this framework to gene expression data, protein sequence data, and HIV protease sequence data. We also show how our framework supports hierarchies involving more than one type of data. We demonstrate that our method extracts useful biological knowledge and is substantially more robust than hierarchical agglomerative clustering.
Full Paper

Rich Probabilistic Models for Gene Expression
E. Segal, B. Taskar, A. Gasch, N. Friedman, D. Koller
Bioinformatics, 2003; 17 Suppl 1:S243-252

Clustering is commonly used for analyzing gene expression data. Despite their successes, clustering methods suffer from a number of limitations. First, these methods reveal similarities that exist over all of the measurements, while obscuring relationships that exist over only a subset of the data. Second, clustering methods cannot readily incorporate additional types of information, such as clinical data or known attributes of genes. To circumvent these shortcomings, we propose the use of a single coherent probabilistic model, that encompasses much of the rich structure in the genomic expression data, while incorporating additional information such as experiment type, putative binding sites, or functional information. We show how this model can be learned from the data, allowing us to discover patterns in the data and dependencies between the gene expression patterns and additional attributes. The learned model reveals context-specific relationships, that exist only over a subset of the experiments in the dataset. We demonstrate the power of our approach on synthetic data and on two real-world gene expression data sets for yeast. For example, we demonstrate a novel functionality that falls naturally out of our framework: predicting the ``cluster'' of the array resulting from a gene mutation based only on the gene's expression pattern in the context of other mutations.
Full Paper