Leache A.D.,University of Washington | Harris R.B.,University of Washington | Rannala B.,University of California at Davis | Rannala B.,CAS Beijing Institute of Genomics | And 2 more authors.
Systematic Biology | Year: 2014

Gene flow among populations or species and incomplete lineage sorting (ILS) are two evolutionary processes responsible for generating gene tree discordance and therefore hindering species tree estimation. Numerous studies have evaluated the impacts of ILS on species tree inference, yet the ramifications of gene flow on species trees remain less studied. Here, we simulate and analyse multilocus sequence data generated with ILS and gene flow to quantify their impacts on species tree inference. We characterize species tree estimation errors under various models of gene flow, such as the isolation-migration model, the n-island model, and gene flow between non-sister species or involving ancestral species, and species boundaries crossed by a single gene copy (allelic introgression) or by a single migrant individual. These patterns of gene flow are explored on species trees of different sizes (4 vs. 10 species), at different time scales (shallow vs. deep), and with different migration rates. Species trees are estimated with the multispecies coalescent model using Bayesian methods (BEST and *BEAST) and with a summary statistic approach (MPEST) that facilitates phylogenomic-scale analysis. Even in cases where the topology of the species tree is estimated with high accuracy, we find that gene flow can result in overestimates of population sizes (species tree dilation) and underestimates of species divergence times (species tree compression). Signatures of migration events remain present in the distribution of coalescent times for gene trees, and with sufficient data it is possible to identify those loci that have crossed species boundaries. These results highlight the need for careful sampling design in phylogeographic and species delimitation studies as gene flow, introgression, or incorrect sample assignments can bias the estimation of the species tree topology and of parameter estimates such as population sizes and divergence times. © 2013 The Author(s).

Agency: European Commission | Branch: FP7 | Program: CP-IP | Phase: HEALTH-2007-2.1.1-4 | Award Amount: 21.36M | Year: 2008

A detailed understanding of human biology will require not only knowledge of the human genome but also of the human metagenome, defined here as the ensemble of the genomes of human-associated microorganisms. Our proposal focuses on the microorganisms of the gut, which are particularly abundant and complex and have an important role for human health and well-being. We shall implement and integrate the following activities: (i) creation of a reference set of genes and genomes of intestinal microbes, using high fidelity metagenomic sequencing and full genome sequencing of selected bacterial species; (ii) creation of the generic tools, based on the high density DNA arrays and novel ultra-high throughput re-sequencing techniques, to study the variation of human gut microbiota; (iii) use of the tools to search for correlations between the genes present in the gut microbiota and disease, focusing on the inflammatory bowel disease and obesity, the two pathologies of increasing social relevance in Europe; (iv) study of the genes correlated with the disease, both in terms of their function in microbes and their effect on the host, with the focus on host-microbe interactions; (v) development of an informatics resource to store and organize the heterogeneous information generated within the project, such as gene and genome sequences, gene frequencies in healthy and sick individuals or gene functions and also enriched by information relevant to human gut microbiota from the outside of the project; (vi) creation of the bioinformatics tools to carry out the meta-analysis of the information; (vii) creation of an interface with the stakeholders, including an international board to promote cooperation and coordination in the human metagenome field, and general public. Our project will place Europe in a leading position in this field and open avenues to modulate human gut microbiota in a reasoned way, enabling to optimize the health and wellbeing of any individual.

Agency: European Commission | Branch: FP7 | Program: CSA-CA | Phase: HEALTH.2010.2.1.1-2 | Award Amount: 2.29M | Year: 2011

A detailed understanding of human biology will require characterisation of the human-associated microorganisms, the human microbiome, and of the roles these microbes play in health and disease. Large projects in Europe, the United States, China and Canada target these objectives, using high throughput omics approaches. Given the complexity of our microbial communities, composed of thousands of species and differing considerably between individuals, as well as the multitude of effects they have on our biology, none of the projects can hope to achieve their comprehensive characterisation. To progress most efficiently towards this ambitious goal it is of utmost importance that the data generated in each individual project be optimally comparable across all the current projects and those yet to come. Our proposal seeks to coordinate development of standard operating procedures and protocols, which will optimize data comparisons in the human microbiome field and thus improve the synergy between all the projects. It focuses on three key aspects of data generation: (i) human sample collection, processing and identification via the associated metadata; (ii) DNA sequence quality obtained by the new generation methods from complex microbial mixtures; (iii) analysis of DNA sequence in conjunction with the metadata. Importantly, it organises public access to the standard operating procedures and protocols and enables exchanges between the users and providers of the standards. It gathers very strong international partnership that includes leaders in the field and represents the current large projects, which span three continents, Europe, Asia and America. Furthermore, it interfaces via the International Human Microbiome Consortium with additional projects from Africa and Australia. The proposal is thus highly congruent with the focus of the call, which targets omics, standards and international context.

Dos Reis M.,University College London | Zhu T.,CAS Beijing Institute of Genomics | Yang Z.,University College London
Systematic Biology | Year: 2014

Bayesian methods provide a powerful way to estimate species divergence times by combining information from molecular sequences with information from the fossil record. With the explosive increase of genomic data, divergence time estimation increasingly uses data of multiple loci (genes or site partitions). Widely used computer programs to estimate divergence times use independent and identically distributed (i.i.d.) priors on the substitution rates for different loci. The i.i.d. prior is problematic. As the number of loci (L) increases, the prior variance of the average rate across all loci goes to zero at the rate 1/L. As a consequence, the rate prior dominates posterior time estimates when many loci are analyzed, and if the rate prior is misspecified, the estimated divergence times will converge to wrong values with very narrow credibility intervals. Here we develop a new prior on the locus rates based on the Dirichlet distribution that corrects the problematic behavior of the i.i.d. prior. We use computer simulation and real data analysis to highlight the differences between the old and new priors. For a dataset for six primate species, we show that with the old i.i.d. prior, if the prior rate is too high (or too low), the estimated divergence times are too young (or too old), outside the bounds imposed by the fossil calibrations. In contrast, with the new Dirichlet prior, posterior time estimates are insensitive to the rate prior and are compatible with the fossil calibrations. We re-analyzed a phylogenomic data set of 36 mammal species and show that using many fossil calibrations can alleviate the adverse impact of a misspecified rate prior to some extent. We recommend the use of the new Dirichlet prior in Bayesian divergence time estimation. [Bayesian inference, divergence time, relaxed clock, rate prior, partition analysis.] © 2014 The Author(s) 2014.

Xie B.,CAS Beijing Institute of Genomics
PloS one | Year: 2013

Divergently paired genes (DPGs), also known as bidirectional (head-to-head positioned) genes, are conserved across species and lineages, and thus deemed to be exceptional in genomic organization and functional regulation. Despite previous investigations on the features of their conservation and gene organization, the functional relationship among DPGs in a given species and lineage has not been thoroughly clarified. Here we report a network-based comprehensive analysis on human DPGs and our results indicate that the two members of the DPGs tend to participate in different biological processes while enforcing related functions as modules. Comparing to randomly paired genes as a control, the DPG pairs have a tendency to be clustered in similar "cellular components" and involved in similar "molecular functions". The functional network bridged by DPGs consists of three major modules. The largest module includes many house-keeping genes involved in core cellular activities. This module also shows low variation in expression in both CNS (central nervous system) and non-CNS tissues. Based on analyses of disease transcriptome data, we further suggest that this particular module may play crucial roles in HIV infection and its disease mechanism.

Zhang D.,CAS Beijing Institute of Genomics
BMC microbiology | Year: 2012

Hepatitis B virus (HBV), because of its error-prone viral polymerase, has a high mutation rate leading to widespread substitutions, deletions, and insertions in the HBV genome. Deletions may significantly change viral biological features complicating the progression of liver diseases. However, the clinical conditions correlating to the accumulation of deleted mutants remain unclear. In this study, we explored HBV deletion patterns and their association with disease status and antiviral treatment by performing whole genome sequencing on samples from 51 hepatitis B patients and by monitoring changes in deletion variants during treatment. Clone sequencing was used to analyze preS regions in another cohort of 52 patients. Among the core, preS, and basic core promoter (BCP) deletion hotspots, we identified preS to have the highest frequency and the most complex deletion pattern using whole genome sequencing. Further clone sequencing analysis on preS identified 70 deletions which were classified into 4 types, the most common being preS2. Also, in contrast to the core and BCP regions, most preS deletions were in-frame. Most deletions interrupted viral surface epitopes, and are possibly involved in evading immuno-surveillance. Among various clinical factors examined, logistic regression showed that antiviral medication affected the accumulation of deletion mutants (OR = 6.81, 95% CI = 1.296 ~ 35.817, P = 0.023). In chronic carriers of the virus, and individuals with chronic hepatitis, the deletion rate was significantly higher in the antiviral treatment group (Fisher exact test, P = 0.007). Particularly, preS2 deletions were associated with the usage of nucleos(t)ide analog therapy (Fisher exact test, P = 0.023). Dynamic increases in preS1 or preS2 deletions were also observed in quasispecies from samples taken from patients before and after three months of ADV therapy. In vitro experiments demonstrated that preS2 deletions alone were not responsible for antiviral resistance, implying the coordination between wild type and mutant strains during viral survival and disease development. We present the HBV deletion distribution patterns and preS deletion substructures in viral genomes that are prevalent in northern China. The accumulation of preS deletion mutants during nucleos(t)ide analog therapy may be due to viral escape from host immuno-surveillance.

Dai L.,CAS Beijing Institute of Genomics
Biology direct | Year: 2012

As advances in life sciences and information technology bring profound influences on bioinformatics due to its interdisciplinary nature, bioinformatics is experiencing a new leap-forward from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. Albeit relatively new, cloud computing promises to address big data storage and analysis issues in the bioinformatics field. Here we review extant cloud-based services in bioinformatics, classify them into Data as a Service (DaaS), Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS), and present our perspectives on the adoption of cloud computing in bioinformatics. REVIEWERS: This article was reviewed by Frank Eisenhaber, Igor Zhulin, and Sandor Pongor.

Ma L.,CAS Beijing Institute of Genomics | Bajic V.B.,King Abdullah University of Science and Technology | Zhang Z.,CAS Beijing Institute of Genomics
RNA Biology | Year: 2013

Long non-coding RNAs (lncRNAs) have been found to perform various functions in a wide variety of important biological processes. To make easier interpretation of lncRNA functionality and conduct deep mining on these transcribed sequences, it is convenient to classify lncRNAs into different groups. Here, we summarize classification methods of lncRNAs according to their four major features, namely, genomic location and context, effect exerted on DNA sequences, mechanism of functioning and their targeting mechanism. In combination with the presently available function annotations, we explore potential relationships between different classification categories, and generalize and compare biological features of different lncRNAs within each category. Finally, we present our view on potential further studies. We believe that the classifications of lncRNAs as indicated above are of fundamental importance for lncRNA studies, helpful for further investigation of specific lncRNAs, for formulation of new hypothesis based on different features of lncRNA and for exploration of the underlying lncRNA functional mechanisms. © 2013 Landes Bioscience.

Liu S.,CAS Beijing Institute of Genomics
Journal of Proteome Research | Year: 2010

A proteomic strategy combining 2DE, Western blot, and mass spectrometry was implemented to survey the status of tyrosine nitration in mouse heart mitochondria. Compared to normal mice, nitrated proteins in the heart mitochondria of the db/db mouse model were significantly augmented due to diabetic development. A total of 18 proteins were identified as the nitration targets. Of the nitrated proteins, succinyl-CoA:3-oxoacid CoA-transferase (SCOT) is a key enzyme involved in ketolysis and has yet to be explored how its catalysis is affected by nitration. We therefore initiated a systematic investigation toward the nitrated site(s) and the corresponding changes of SCOT catalysis. To monitor modification kinetics and nitrated residue(s), recombinant SCOT was incubated with peroxynitrite followed by examination of nitration development as well as catalytic activity changes. The nitration of recombinant SCOT steadily increased in response to increasing concentrations of peroxynitrite, while its catalysis was gradually attenuated. The nitrated sites of modified SCOT were further identified by LC-ESI-MS/MS. The MS/MS spectra indicated a +45 mass unit ion shift from [M + H]+ m/z at Tyr 4 and Tyr76. Through site-directed mutagenesis, we found that mutation of tyrosine residues at Tyr4 or Tyr76 did not only significantly protect SCOT from peroxynitrite modification, but it also dramatically prevented loss of enzymatic activity. Taken together, these results indicate that the two tyrosine residues of SCOT are the priority sites attacked by NO, and their nitration status is a causal factor leading to inhibition of SCOT catalysis. © 2010 American Chemical Society.

CAS Beijing Institute of Genomics | Date: 2014-12-05

The present invention provides a sequencing library, and the sequencing library has an inserted fragment which is an equidirectional alternating concatemer of a sequence to be tested and a tag sequence. The present invention further provides a method for preparing the sequencing library. The present invention also provides a sequencing method. The sequencing library and sequencing method as provided in the present invention are capable of removing DNA amplification errors and sequencing errors under any sequencing depths, so that mutations of DNA molecules could be ultra-accurately determined. The sequencing library of the present invention is suitable for construction of a sequencing library of trace short DNA fragments and even of single-strand DNAs.

