We do research in bioinformatics, applying computational approaches to problems in molecular biology. Broadly, we are interested in large-scale analyses of genome sequences, macromolecular structures, and functional-genomics datasets. It is hoped that these will allow us to address a number of overall statistical questions about macromolecules, relating to their physical properties, cellular function, interactions, and phylogenetic distribution. We are especially focused on the human genome and proteome. Our research involves a number of quantitative techniques, including database design, systematic datamining and machine learning, visualization of high-dimensional data, and molecular simulation. More specifically, we focus on three questions. First, we are interested in annotating the raw human genome sequence, especially in characterizing the vast intergenic regions and one of their most important elements, pseudogenes. Next, we are trying to get at the function of all the protein elements encoded by the genome. Here, we try to characterize function on a large-scale through the use of molecular networks. Finally, for the population of proteins that have known 3D structures, we are trying to see how their function is carried out through motion and how motion can be predicted from packing geometry.
Extensive Research Description
The biological sciences are being transformed by the advent of large-scale data. The sequencing of the human genome is a most dramatic example of this. Simultaneously, with this increase in biological data, computers and computation have had a transforming effect on the way information is handled, stored, and mined. These computational advances, of course, apply to many facets of life. The goal of my lab is to connect these two developments, harnessing computational advances for the analysis of large-scale data, principally by carrying out integrative surveys, systematic data mining, and molecular simulation.
Specifically, we are focused on protein bioinformatics: understanding the structure, function, and evolution of proteins through analyzing populations of them in the databases and in whole-genome experiments. Overall we have four research foci, summarized below.
1. Genomics: Mining Intergenic Regions, especially in relation to Pseudogenes
We are involved in a number of large-scale collaborations to probe the activity of intergenic regions with tiling array technology. The overall conclusion from this work has been that much of the intergenic regions of the human genome appear to be active, both transcriptionally and in terms of protein binding. In connection with tiling-array experiments, we have done an extensive amount of intergenic annotation, with a particular focus on mining intergenic regions for pseudogenes (protein fossils). Collectively, our studies enable us to determine the common "pseudofamilies" in various genomes and address important evolutionary questions about the proteins that were present in the past history of an organism.
2. Proteomics: Using Networks to Understand Protein Function
After the main elements of the human genome are identified, one needs to characterize their function. We are trying to characterize gene function through molecular networks. We work on systematically integrating many weak functional genomic features with data mining techniques to predict protein networks (comprising protein interactions and other functional linkages). In addition, we have studied the structure of protein networks, both on a large-scale in terms of global statistics (e.g., the diameter) and on a small-scale in terms of local network motifs (e.g., hubs).
3. Structural Genomics: Analysis of Folds, Families and Functions on a Large Scale
Another area of research in our lab is structural genomics. Here, we conceptualize proteins not purely as character sequences or abstract network nodes, but more in terms of their molecular structure. We have examined the large-scale relationships between sequence, structure and function in order to understand the extent to which structural and functional annotation can reliably be transferred between similar sequences, particularly when similarity is expressed in modern probabilistic language. We have also related the occurrence of protein folds and families to phylogeny and deep evolutionary history.
4. Computational Biophysics: Relating Motions&Packing
The final area of focus in the lab is analyzing small populations of structures in terms of their detailed 3D-geometry and physical properties. Here, we try to interpret macromolecular motions in terms of packing. We have set up a database of macromolecular motions and coupled it with simulation tools to interpolate between structural conformations; the database also has tools to predict likely motions based on simple models, such as normal modes and localized hinges connecting rigid domains.
- Yip KY, Kim PM, McDermott D, Gerstein M. BMC Bioinformatics. 2009 Aug 5;10(1):241. [Epub ahead of print]
- Gerstein, M. and Zheng, D. (2006). The real life of pseudogenes. Sci. Am. 295: 48-55.
- Kim, P.M., Lu, L.J., Xia, Y., and Gerstein, M.B. (2006). Relating three-dimensional structures to protein networks provides evolutionary insights. Science 314:1938-41.
- H. Yu, M. Gerstein (2006). Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci U S A 103: 14724-31. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein M (2009). Peak