My research is focused on developing bioinformatics algorithms especially through sequencing analysis and data integration, to understand better transcriptional and epigenetic regulation in human disease.
Human’s knowledge on our genome evolves dramatically over the recent decades. When I did my undergraduate project in Physics department to study protein folding in silico, all we know is the central dogma “DNA makes RNA makes protein”. In 2001, the first human genome draft was published, one year before I was enrolled as a PhD student in bioinformatics. The advent of complete genome technologies grants human the potential to decode every piece of information buried in our DNA. I worked with pioneers in Beijing Genome Institute on several international genome projects such as Human Genome Project (HGP), Rice Genome, Pig Genome and later on the HapMap projects. DNA sequence itself can’t tell any story without structural and functional annotations. While annotating genomes, we found that the proportion of a genome which can be transcribed into mRNA then translated into proteins may be very small ( as for human, only ~2% ). The mysterious functions of so-called ‘junk DNA’ or ‘dark matter’ in our genome attracted people’s attention, that ultimately leaded to the Encyclopedia of DNA Elements (ENCODE) and the Model Organism ENCODE (modENCODE) projects. During the last year of my PhD, my colleagues and I applied the first whole genome transcriptome analysis for the nematode C. elegans. We discovered hundreds of RNA transcripts that can be mapped to the ‘junk DNA’ where no protein-coding genes have been identified. However genome not only contains information to transcribe functional RNAs, but also contains the ‘regulatory elements’ -- the brakes and accelerators controlling gene expression. To identify and decipher the functions of regulatory elements further became my major direction after I joined X. Shirley Liu’s lab in Harvard as a postdoc fellow. Among all types of regulatory elements, I am particularly interested in the cis-regulatory DNA elements bound by trans-acting factors ( a.k.a cistrome, a name invented by my colleagues at Dana-Farber) or marked by epigenetic modifications such as DNA methylation or histone modifications.
Transcription factor often binds to DNA and interferes with transcription machinery to enhance or repress gene expression. Epigenetic features such as histone modification, chromatin remodeling factor binding, DNA methylation, and chromatin 3D organization add yet another layer of information, making it more complex to understand the regulation dynamics within the nucleus. With advancing sequencing technology, however, such information now can be measured and quantified in genome scale, though the growing number of big genomic datasets creates challenges as well as opportunities for bioinformatics methodologies.
The MACS (Genome Biology 2008, cited over 840 times according to Google Scholar) algorithm, on which I worked to develop within my postdoc lab, is one of the most widely-used algorithms for predicting cis-regulatory elements from Chromatin Immunoprecipitation with high-throughput sequencing (ChIP-Seq). The algorithm has been evolving over years to accommodate various factor types from punctuate transcription factor binding to long-range histone modifications. It has been used to process hundreds of publicly-available datasets in the mod/ENCODE project, and it continues as a focus of my lab. In order to elucidate the function of a particular transcription factor and link the factor to human disease, it is important to compare cistrome of the same factor in two or more different conditions, such as a knock-down assay against wildtype, or health sample against disease sample. However there is few published algorithms designed for such purpose. Existing solutions mainly depend on predefined genomic regions of interest where the cistrome is described as binary profile with present and absent calls. This method relies on cutoff values for the initial calls and is unable to provide detailed resolution. One of my ongoing projects is to generate quantitative cistrome profiles by assigning values representing strength of factor-DNA interaction for each genomic location. In this way, differences on cistome between two conditions can be described more flexibly, not only as bound in one condition and absent in another, but also as changes of binding affinity or local shifts of protein-DNA interaction sites.
Nowadays, ChIP experiments are widely used by researchers to investigate the genome-wide transcription factor regulation or histone modifications. However, although there are protocols to follow in experimental lab to generate clean and robust data from microarray or sequencers, there are no protocols to follow for downstream data analysis. Researchers need to have both the hardware resources and computer expertise to use different algorithms effectively. To eliminate the barrier between experimentalists and bioinformaticians, and to integrate data analyses across different platform ( such as RNA-seq, ChIP-seq and so on), we build a user-friendly web service based on Galaxy open framework called Cistrome Analysis Pipeline or CisrtromeAP (Genome Biology 2011). Till now, there are over 4,000 registered users worldwide running analyses on CistromeAP server. The Cistrome platform will continue as a collaborative project between my UB lab and research partners at Harvard University. The future direction is to extend the functionalities of the CistromeAP platform. Currently, we are planning to add tools for nucleosome positioning analysis, histone dynamics, DNA methylation and gene expression, as well as to compile more workflows for researchers to easily reproduce their analyses.
I have also been involved in many collaborative research projects, such as circadian binding of histone deacetylase and nuclear receptor Rev-Erba in mouse liver (Science 2011), the modENCODE consortium project to elucidate chromatin factor functions of C. elegans (Genome Research 2011 and Science 2010), and the ENCODE Data Analysis Center on developing and evaluating algorithms for ChIP-Seq and DNAse-Seq data (Genome Research 2013). Working in consortium gives me opportunity to study transcriptional and epigenetic regulation systematically. We combined over 40 different histone modification profiles for C. elegans and identified broad chromosomal domains with similar histone modifications corresponding to active chromatin regions and silenced regions. Surprisingly, those regions are almost conserved between different developmental stages of worm, and have a significant correlation with chromatin spatial organization in nucleus. When we compared the epigenetic profiles of two model organisms C. elegans and D. melanogaster (fruit fly), together with human (manuscript submitted to Nature), we found that all species are speaking the similar ‘epigenetic’ language. For example, H3 lysine 4 trimethylation (H3K4me3) can be found at active promoter regions of genes; H3K27me3 at silenced genes; H3K36me3 at transcribed gene bodies; H3K9me3 at heterochromatin regions, and so on.