Kelley D.R.,University of Maryland University College |
Schatz M.C.,Simons Center for Quantitative Biology |
Salzberg S.L.,University of Maryland University College
Genome Biology | Year: 2010
We introduce Quake, a program to detect and correct errors in DNA sequencing reads. Using a maximum likelihood approach incorporating quality values and nucleotide specific miscall rates, Quake achieves the highest accuracy on realistically simulated reads. We further demonstrate substantial improvements in de novo assembly and SNP detection after using Quake. Quake can be used for any size project, including more than one billion human reads, and is freely available as open source software from http://www.cbcb.umd.edu/software/quake. © 2010 Kelley et al.; licensee BioMed Central Ltd.
Salzberg S.L.,Johns Hopkins University |
Phillippy A.M.,Battelle |
Zimin A.,University of Maryland University College |
Puiu D.,Johns Hopkins University |
And 10 more authors.
Genome Research | Year: 2012
New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three over-arching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study. © 2012 by Cold Spring Harbor Laboratory Press.
Mohammed J.,Cornell University |
Mohammed J.,Tri Institutional Training Program in Computational Biology and Medicine |
Siepel A.,Cornell University |
Siepel A.,Simons Center for Quantitative Biology |
Lai E.C.,Sloan Kettering Institute
RNA | Year: 2014
Many animal miRNA loci reside in genomic clusters that generate multicistronic primary-miRNA transcripts. While clusters that contain copies of the same miRNA hairpin are clearly products of local duplications, the evolutionary provenance of clusters with disparate members is less clear. Recently, it was proposed that essentially all such clusters in Drosophila derived from de novo formation of miRNA-like hairpins within existing miRNA transcripts, and that the maintenance of multiple miRNAs in such clusters was due to evolutionary hitchhiking on a major cluster member. However, this model seems at odds with the fact that many such miRNA clusters are composed of well-conserved miRNAs. In an effort to trace the birth and expansion of miRNA clusters that are presently well-conserved across Drosophilids, we analyzed a broad swath of metazoan species, with particular emphasis on arthropod evolution. Beyond duplication and de novo birth, we highlight a diversity of modes that contribute to miRNA evolution, including neofunctionalization of miRNA copies, fissioning of locally duplicated miRNA clusters, miRNA deletion, and miRNA cluster expansion via the acquisition and/or neofunctionalization of miRNA copies from elsewhere in the genome. In particular, we suggest that miRNA clustering by acquisition represents an expedient strategy to bring cohorts of target genes under coordinate control by miRNAs that had already been individually selected for regulatory impact on the transcriptome. © 2014 Mohammed et al.
Kinney J.B.,Simons Center for Quantitative Biology
Physical Review E - Statistical, Nonlinear, and Soft Matter Physics | Year: 2015
The need to estimate smooth probability distributions (a.k.a. probability densities) from finite sampled data is ubiquitous in science. Many approaches to this problem have been described, but none is yet regarded as providing a definitive solution. Maximum entropy estimation and Bayesian field theory are two such approaches. Both have origins in statistical physics, but the relationship between them has remained unclear. Here I unify these two methods by showing that every maximum entropy density estimate can be recovered in the infinite smoothness limit of an appropriate Bayesian field theory. I also show that Bayesian field theory estimation can be performed without imposing any boundary conditions on candidate densities, and that the infinite smoothness limit of these theories recovers the most common types of maximum entropy estimates. Bayesian field theory thus provides a natural test of the maximum entropy null hypothesis and, furthermore, returns an alternative (lower entropy) density estimate when the maximum entropy hypothesis is falsified. The computations necessary for this approach can be performed rapidly for one-dimensional data, and software for doing this is provided. © 2015 authors. Published by the American Physical Society. Published by the American Physical Society under the terms of the Creative Commons Attribution 3.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.
Schatz M.C.,Simons Center for Quantitative Biology
Genome Research | Year: 2015
The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than 1 billion genomes, bringing even deeper insight into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential for medicine and biology, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches that can keep pace with the rapid improvements to biotechnology. In this perspective, I aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world. © 2015 Schatz.