Our long-term research goal is to study evolutionary/comparative genomics, population genetics and computational biology. My lab is developing research projects with a combination of statistical/computational method developing, software engineering, large-scale multi-layer data analysis, and experimental work.


Statistical framework for Gene family Evolution
  We are developing statistical approaches to have a better understanding functional innovation, specification and divergence during the course of gene family evolution. The random-field model of evolutionary rate in terms of lineage (subtree), and site (region) provide a general statistical framework that includes functional divergence (Mol Biol Evol. 2001. 18:453; J. Comp Biol. 2001 8:221), sit-by-site dependence, positive selection, etc. as special cases. Moreover, we are focused on how to incorporate biological covariates (from protein structure, knockout phenotype to expression profile) appropriately such that the statistical model becomes more biologically relevant.
Integrated Software System
  Our goal is to develop an integrated software system to extract function-related information from DNA/amino acid sequence evolution. We have developed a software system, DIVERGE 1.0 (Bioinformatics 2002. 18:500) for functional divergence analysis and prediction based on site-specific evolutionary rate changes, with a protein 3D viewer mapping these predicted residues onto the 3D structure (with >300 download cases) (Mol Biol Evol. 1999. 16:1664). We are working new version DIVERGE 2.0 that will include more options: rate variation among sites (Mol Biol Evol. 1997. 14:1106), ancestral sequence inference, functional distance analysis, and other statistical methods for functional divergence (Mol Biol Evol. 2001. 18: 453; 18:2327).
Evidence for the Association between Site-Specific Rate Shifts and Changes in Function after Gene Duplication
  As the evolutionary rate of an amino acid residue is inversely correlated with its functional importance, site-specific shift of evolutionary rate within a gene family (i.e., a site is highly variable in one duplicate gene but very conserved in the other one) may indicate change in functional importance after gene duplication. Using statistical methods in DIVERGE, we have conducted a large-scale analysis showing that site-specific rate shift is a general evolutionary pattern (J Mol Evol. 2002. 54:725). Moreover, we found evidence that the level of site-specific rate shift of member genes could be related to protein structure differences (Genetics. 2001. 158:1311; Trend in Biochem Sci. 2002. 27:315), the severity of knockout phenotypes, and tissue-specificity (Gu et al. PNAS under review).


Vertebrate genome duplication & origin of human gene family hierarchy
  The 2R model of vertebrate genome duplications, which is under hot debate, postulates two successive polyploidizations prior to the origin of fishes (Genome Res. 2002. 12:1). We address this issue by estimating the age distribution of paralogous genes in the human genome. In total 1,739 gene duplication events are dated from the phylogenetic analysis of 749 vertebrate gene families, which shows a pattern characterized by two waves (I, II) and an ancient component. While Wave I represents a recent gene family expansion by tandem or segmental duplications, Wave II, a rapid paralogous gene increase in the early stage of vertebrates, supporting the 2R model (the big-bang mode). Further analysis indicates that large and small-scale gene duplications both have significant contributions during early stage of vertebrate evolution to building the current hierarchy of human gene families. (Nature Genetics. 2002. 31:205). Our future research plan is to study (1) the impact of gene family proliferation on functional innovations (from tissue-specificity to molecular pathways; (2) the pattern of functional divergence in human major gene families (protein kinases, etc.); and (3) the joint distribution of age and chromosome location of duplication genes.
Algorithms for ancestral gene order inference and comparative genome mapping
  Genome-level comparative mapping raises new challenges in studies of genome rearrangement, especially for multiple species genome rearrangement problem. In this case, a genome is viewed as a string of signed permutation where each integer corresponds to a unique gene location and the sign corresponds to its orientation. The Multiple Genome Rearrangement Problem (MGRP) is to find a most parsimonious rearrangement scenario for multiple genomes. Since MGRP is NP-hard, we address this issue by developing efficient heuristic algorithms, e.g., the nearest path search algorithm (Pacific Symposium on Biocomputing (PSB) 2002. 7:259; 2003 in press). We are working on other approaches including neighbor-perturbing algorithm, branch-and-bound algorithm, and simulated annealing algorithm.
Whole-genome phylogenetic analysis based on gene (family) content
  For a complete genome, gene (family) content is a string of 1 or 0's representing the presence/absence of gene families. It has been used to infer the minimum set of essential genes, estimate the size of ancestral genome, reconstruct the genome tree of life, and predict the functional interaction between genes, but the results are subject to controversy. The bottleneck is the lacking of rigorous statistical framework for gene content evolution. We are developing probabilistic models, aiming at (1) the controversy between the genome tree of life and the lateral gene transfer, (2) functional interaction prediction under the statistical framework of phylogenetic tree, and (3) statistical testing for the existence of minimum gene set. For instance, we have developed a software system GenContent for inferring the genome tree (PNAS 2002, Gu-Zhang submitted).


Statistical framework for expression profile evolution
  For a gene family from a single species (e.g., yeast) with a known phylogeny, our goal is to develop a joint density for gene expressions, when substantial microarray data are available. Similar to DNA sequences, it will provide a statistical framework to explore the pattern of gene expression evolution (likelihood ratio test), infer the ancestral expression pattern (Bayesian analysis), and phylogeny inference (Gu, PNAS submitted). Moreover, the joint distribution of expressions and motifs under a tree is developing.
Evolution of repeat elements, gene regulation and motif detecting
  We (collaborated with other groups) have found evidence for the role of repeat elements in regulatory motif spreading in human (Alu) (Genome Res. 2002 in press) and mouse (B1) (Gene. 2000. 245:319). After conducting a whole-genome association study between yeast sporulation expressions and the regulatory motif MSE, we have found a positive association between induced expression and MSE between recent duplicate pairs, but not for ancient duplicates (Information Sciences, 2002, in press). It seems that the model of subfunction-loss after duplication needs to be revisited because acquisition of a motif could be repeat-element-mediated. Moreover, nucleotide changes in transcription factors instead of binding sites should also be considered (Mol. Biol. Evol. 2002 19:1490). In bioinformatics, we are studying the power of "phylogenetic footprinting" for motif detection, when repeat element activities are overwhelming during the course of genome evolution.
Genetic buffering, duplicate genes and network complexity
  Knocking out a gene in an organism often has little phenotypic effect, owing to two mechanisms: the existence of duplicate genes, and genetic buffering of network (canalization), but their relative importance is controversial (Gu Trends in Genetics in press). Using fitness data for a complete set of single-gene-deletion mutants of the yeast genome, we (collaborated with other groups) have conducted a genome-wide evaluation of the role of duplicate genes in genetic robustness against null mutations, and found evidence for functional compensation by duplicate genes (Nature, in press). We are developing more vigorous models to explore the evolutionary mechanisms of functional compensation and divergence within a gene family. On the other hand, genetic buffering stems from the complexity of gene networks called scale-free but little is known about its emergence during evolution. Several evolutionary models are under study, considering (1) growth of genome size by domain-gene-genome duplications, (2) algometry growth of gene interactions, (3) random gene/connection loss. We are also interested in the connection of functional divergence and the origin of gene network modularity, which is not fully expected by the scale-free complexity.


Gene expression evolution in humans and chimpanzees
  The regulatory hypothesis claims that humans and chimpanzees differ considerably at mental and linguistic capability because of gene regulation changes. A recent comparative microarray analysis in human and great apes supports this hypothesis but cast some doubts because of the statistical problem. Reanalysis of the Affymetrix data (Trends in Genetics (J. Gu- X. Gu) in press) shows that the dramatic brain-expression alterations in humans since the split from chimpanzees is mainly driven by a set of genes with increased (rather than decreased) expression levels in the human brain. Furthermore, we have identified a set of genes with significant changes in the human brain (induced or repressed) since the split of human-chimpanzee. My wet-lab research wing is to sequence the homologous 5'-noncoding region of these genes in several primates in attempt to identify human-lineage specific DNA substitutions. Collaborated with other groups, the 5'- surrounding regions of these genes will be compared with the mouse genome to identify conserved noncoding islands. Human population genetics survey (SNP) around this gene region will be used to test the role of positive selection (e.g., by Tajima or Fu-Li’s test).
Interplay between species evolution and population genetics
  We are interested how to use sequence evolution information to estimate population genetics parameters. Currently we are focused on (1) the variation of mutation rate and indels along the genome, as well as the effects of GC content, codon-usage bias, sequence features, etc; (2) site-specific selection intensity – the association of evolutionary conservation and potential disease-related site. Moreover, we are examining the population genetics basis of molecular evolutionary parameters. For instance, we show that the a gamma parameter for rate variation among sites is inversely proportional to the squared root of effective population size under the stabilized-selection model.