Simons Center for Quantitative Biology

Cold Spring Harbor, NY, United States

Simons Center for Quantitative Biology

Cold Spring Harbor, NY, United States

Time filter

Source Type

Huang Y.-F.,Simons Center for Quantitative Biology | Gulko B.,Simons Center for Quantitative Biology | Gulko B.,Cornell University | Siepel A.,Simons Center for Quantitative Biology
Nature Genetics | Year: 2017

Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.


Core L.J.,Cornell University | Core L.J.,University of Connecticut | Martins A.L.,Cornell University | Danko C.G.,Cornell University | And 5 more authors.
Nature Genetics | Year: 2014

Despite the conventional distinction between them, promoters and enhancers share many features in mammals, including divergent transcription and similar modes of transcription factor binding. Here we examine the architecture of transcription initiation through comprehensive mapping of transcription start sites (TSSs) in human lymphoblastoid B cell (GM12878) and chronic myelogenous leukemic (K562) ENCODE Tier 1 cell lines. Using a nuclear run-on protocol called GRO-cap, which captures TSSs for both stable and unstable transcripts, we conduct detailed comparisons of thousands of promoters and enhancers in human cells. These analyses identify a common architecture of initiation, including tightly spaced (110 bp apart) divergent initiation, similar frequencies of core promoter sequence elements, highly positioned flanking nucleosomes and two modes of transcription factor binding. Post-initiation transcript stability provides a more fundamental distinction between promoters and enhancers than patterns of histone modification and association of transcription factors or co-activators. These results support a unified model of transcription initiation at promoters and enhancers. © 2014 Nature America, Inc. All rights reserved.


Siepel A.,Cornell University | Siepel A.,Simons Center for Quantitative Biology | Arbiza L.,Cornell University
Current Opinion in Genetics and Development | Year: 2014

Modification of gene regulation has long been considered an important force in human evolution, particularly through changes to cis-regulatory elements (CREs) that function in transcriptional regulation. For decades, however, the study of cis-regulatory evolution was severely limited by the available data. New data sets describing the locations of CREs and genetic variation within and between species have now made it possible to study CRE evolution much more directly on a genome-wide scale. Here, we review recent research on the evolution of CREs in humans based on large-scale genomic data sets. We consider inferences based on primate divergence, human polymorphism, and combinations of divergence and polymorphism. We then consider 'new frontiers' in this field stemming from recent research on transcriptional regulation. © 2014 Elsevier Ltd.


Mohammed J.,Cornell University | Mohammed J.,Tri Institutional Training Program in Computational Biology and Medicine | Siepel A.,Cornell University | Siepel A.,Simons Center for Quantitative Biology | Lai E.C.,Sloan Kettering Institute
RNA | Year: 2014

Many animal miRNA loci reside in genomic clusters that generate multicistronic primary-miRNA transcripts. While clusters that contain copies of the same miRNA hairpin are clearly products of local duplications, the evolutionary provenance of clusters with disparate members is less clear. Recently, it was proposed that essentially all such clusters in Drosophila derived from de novo formation of miRNA-like hairpins within existing miRNA transcripts, and that the maintenance of multiple miRNAs in such clusters was due to evolutionary hitchhiking on a major cluster member. However, this model seems at odds with the fact that many such miRNA clusters are composed of well-conserved miRNAs. In an effort to trace the birth and expansion of miRNA clusters that are presently well-conserved across Drosophilids, we analyzed a broad swath of metazoan species, with particular emphasis on arthropod evolution. Beyond duplication and de novo birth, we highlight a diversity of modes that contribute to miRNA evolution, including neofunctionalization of miRNA copies, fissioning of locally duplicated miRNA clusters, miRNA deletion, and miRNA cluster expansion via the acquisition and/or neofunctionalization of miRNA copies from elsewhere in the genome. In particular, we suggest that miRNA clustering by acquisition represents an expedient strategy to bring cohorts of target genes under coordinate control by miRNAs that had already been individually selected for regulatory impact on the transcriptome. © 2014 Mohammed et al.


Kelley D.R.,University of Maryland University College | Schatz M.C.,Simons Center for Quantitative Biology | Salzberg S.L.,University of Maryland University College
Genome Biology | Year: 2010

We introduce Quake, a program to detect and correct errors in DNA sequencing reads. Using a maximum likelihood approach incorporating quality values and nucleotide specific miscall rates, Quake achieves the highest accuracy on realistically simulated reads. We further demonstrate substantial improvements in de novo assembly and SNP detection after using Quake. Quake can be used for any size project, including more than one billion human reads, and is freely available as open source software from http://www.cbcb.umd.edu/software/quake. © 2010 Kelley et al.; licensee BioMed Central Ltd.


Wences A.H.,Simons Center for Quantitative Biology | Wences A.H.,National Autonomous University of Mexico | Schatz M.C.,Simons Center for Quantitative Biology
Genome Biology | Year: 2015

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for meta-assembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net. © 2015 Wences and Schatz.


Schatz M.C.,Simons Center for Quantitative Biology
Genome Research | Year: 2015

The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than 1 billion genomes, bringing even deeper insight into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential for medicine and biology, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches that can keep pace with the rapid improvements to biotechnology. In this perspective, I aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world. © 2015 Schatz.


Kinney J.B.,Simons Center for Quantitative Biology
Physical Review E - Statistical, Nonlinear, and Soft Matter Physics | Year: 2015

The need to estimate smooth probability distributions (a.k.a. probability densities) from finite sampled data is ubiquitous in science. Many approaches to this problem have been described, but none is yet regarded as providing a definitive solution. Maximum entropy estimation and Bayesian field theory are two such approaches. Both have origins in statistical physics, but the relationship between them has remained unclear. Here I unify these two methods by showing that every maximum entropy density estimate can be recovered in the infinite smoothness limit of an appropriate Bayesian field theory. I also show that Bayesian field theory estimation can be performed without imposing any boundary conditions on candidate densities, and that the infinite smoothness limit of these theories recovers the most common types of maximum entropy estimates. Bayesian field theory thus provides a natural test of the maximum entropy null hypothesis and, furthermore, returns an alternative (lower entropy) density estimate when the maximum entropy hypothesis is falsified. The computations necessary for this approach can be performed rapidly for one-dimensional data, and software for doing this is provided. © 2015 authors. Published by the American Physical Society. Published by the American Physical Society under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.


Kinney J.B.,Simons Center for Quantitative Biology
Physical Review E - Statistical, Nonlinear, and Soft Matter Physics | Year: 2014

The question of how best to estimate a continuous probability density from finite data is an intriguing open problem at the interface of statistics and physics. Previous work has argued that this problem can be addressed in a natural way using methods from statistical field theory. Here I describe results that allow this field-theoretic approach to be rapidly and deterministically computed in low dimensions, making it practical for use in day-to-day data analysis. Importantly, this approach does not impose a privileged length scale for smoothness of the inferred probability density, but rather learns a natural length scale from the data due to the tradeoff between goodness of fit and an Occam factor. Open source software implementing this method in one and two dimensions is provided. © Published by the American Physical Society.


Salzberg S.L.,Johns Hopkins University | Phillippy A.M.,Batelle Memorial Institute | Zimin A.,University of Maryland University College | Puiu D.,Johns Hopkins University | And 10 more authors.
Genome Research | Year: 2012

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three over-arching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study. © 2012 by Cold Spring Harbor Laboratory Press.

Loading Simons Center for Quantitative Biology collaborators
Loading Simons Center for Quantitative Biology collaborators