Smith T.M.,Geospiza, Inc.
Advances in Experimental Medicine and Biology | Year: 2010
Next Generation Sequencing technologies are limited by the lack of standard bioinformatics infrastructures that can reduce data storage, increase data processing performance, and integrate diverse information. HDF technologies address these requirements and have a long history of use in data-intensive science communities. They include general data file formats, libraries, and tools for working with the data. Compared to emerging standards, such as the SAM/BAM formats, HDF5-based systems demonstrate significantly better scalability, can support multiple indexes, store multiple data types, and are self-describing. For these reasons, HDF5 and its BioHDF extension are well suited for implementing data models to support the next generation of bioinformatics applications. © 2010 Springer Science+Business Media, LLC. Source
Agency: Department of Health and Human Services | Branch: | Program: SBIR | Phase: Phase I | Award Amount: 110.00K | Year: 2009
DESCRIPTION (provided by applicant): In November 2008, The Scientist opened an on-line opinion piece with the following quote: After tens of billions of US federal dollars (plus billions more from private sources) and nearly 40 years of aggressive research, the war on cancer is depressingly far from over. Cancer will soon become the leading cause of death in America, passing heart disease. At some point in their lives, 43% of the public will get some form of cancer. While much progress has been made over the years, effective treatments for many forms of cancer are still lacking. Until the many forms of cancer are better understood, treatment options will continue to lag behind. Next generation DNA sequencing (NGS) technologies hold great promise as tools for building a new understanding of cancer and its origins. Deep sequencing provides more sensitive ways to detect the germline and somatic mutations that cause different types of cancer as well as identify new mutations within small subpopulations of tumor cells that can be prognostic indicators of tumor growth or drug resistance. The ultimate goal is to use NGS technologies in the clinic. Before this vision can be realized, many obstacles must be overcome. Assay costs must be significantly lowered and sample throughput must be substantially increased relative to today's capabilities. Achieving this goal will require that we have streamlined procedures for sample preparation and laboratory processes, a complete understanding of NGS systems, error profiles, and assay dynamics, and robust validatable software systems to support diagnostic tests in the clinical enterprise. Geospiza's FinchLab software platform addresses a large number of issues related to operating NGS instruments and laboratory processes in clinical environments. However, our understanding of NGS errors and how to completely characterize NGS datasets, with respect to their potential to deliver high quality information, is incomplete. Through the proposed research, Geospiza and collaborators at the Mayo Clinic will remove many of the obstacles that keep this vision of cancer diagnostics from becoming reality. In the Phase I project, we will test the feasibility of developing clinical systems by characterizing a limited number of NGS datasets for true variants, false positive, and false negative errors by cataloging discrepant bases relative to control sequences, with respect to sequence contexts, random noise, laboratory steps, and instrument artifacts. The catalogs will then be used to develop statistical algorithms that can analyze large numbers of aligned reads and assign variant detection probabilities to individual bases, as well as calculate summary statistics that can be used to assign descriptive values to datasets from individual samples, and subsequently identify sample artifacts and issues related to sample processing. Geospiza will combine the insights gained, and new software tools developed, into the FinchLab system to give researchers better ways to work with NGS data and more clear-cut methods for visualizing genetic assay results presented in web-based interfaces. In addition, Geospiza will promote community involvement by making many of the core algorithms available through BioConductor. PUBLIC HEALTH RELEVANCE: The SBIR project Software Systems for Detecting Rare Mutations will deliver new software technologies to further advance the applications for deep DNA sequencing in personalized medicine by improving methods for detecting rare mutations that define cancer types and determine how a cancer cell may grow and respond to, or resist, treatment. In addition to improving cancer research and diagnostics, the software developed will have general use for any application where DNA sequencing is used to understand the genetic basis of human health, disease, and response to drug therapies.
Agency: Department of Health and Human Services | Branch: | Program: STTR | Phase: Phase II | Award Amount: 1.17M | Year: 2009
DESCRIPTION (provided by applicant): The first wave of Next Generation ( Next Gen ) sequencing technologies combines molecular resolution with extremely high throughput to dramatically reduce sequencing costs and increase assay sensitivity and specificity. These technologies will provide large numbers of laboratories with Genome Center levels of throughput to make discoveries and develop new assays never before imagined. However, widespread adoption of Next Gen will be hindered because current bioinformatics programs do not scale; they are inefficient in data storage, processing, and memory utilization. The most popular programs typically copy and recopy data to new files many times during processing, require that all data be maintained in random access memory (RAM) when running, and cannot incrementally process data. To overcome these issues, fundamental changes in data management and processing are needed. Geospiza and The HDF Group are collaborating to develop portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format http://www.hdfgroup.org ). We call these extensible domain-specific data technologies BioHDF. BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, and meta data) and results from sequence assembly and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, and graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing. BioHDF will include APIs, software tools, and a viewer based on HDFView to enable its use in the bioinformatics and research communities. Using BioHDF, researchers will be able perform whole genome shotgun sequencing (WGS), tag and count experiments (EST analysis, promoter mapping, DNA methylation, functional mapping), and variation analysis; they will also be able to export datasets in formats accepted by the key databases to publish their work. As a programming environment, BioHDF can be easily extended to accept data from new data collection platforms, and format data for interchange with many databases. Core BioHDF tools will be delivered to the research community as an open source technology. Geospiza will use BioHDF in its Finch. line of products to deliver software systems and applications to support clinical research, diagnostics, and other relevant activities that rely on genetic data. PUBLIC HEALTH RELEVANCE: The overall goal of the BioHDF Phase II project is to make it possible for medical research and clinical communities to take full advantage of the latest DNA sequencing platforms in their efforts to improve public health. Geospiza and The HDF Group will build on their expertise in Laboratory Information Management Systems and high- volume, high-complexity scientific data management systems to create and deliver bioinformatics software systems that can handle the massive amounts of data produced by the latest sequencing instruments. The integrated systems will keep track of collected samples, sequence data, DNA tests, and other laboratory records and biological data associated with the entire sequencing and analysis process, and make it easy for clinicians to use the technology to do their work.
Agency: Department of Health and Human Services | Branch: | Program: STTR | Phase: Phase I | Award Amount: 142.78K | Year: 2005
DESCRIPTION (provided by applicant): Geospiza Inc. and the National Center for Supercomputing Applications (NCSA) are creating a standards based software framework around NCSA's Heirarchical Data Format (HDF5). The envisioned framework will integrate algorithms important in DNA and protein sequence analysis to create scalable high throughput software systems which will be accessed using new graphical user interfaces (GUIs) to provide researchers with new views of their data to finish sequencing projects in large-scale genome sequencing, microbial genome sequencing, viral epidemiology, polymorphism detection, phylogenetic analysis, multi-locus sequence typing, confirmatory sequencing, and EST analysis. In our vision, algorithms will be either integrated into the system to directly read and write from HDF5 project files, or they will communicate with project files via filter programs that produce standardized XML formatted data. Through this model, a scalable solution will support different applications of DNA sequencing, fulfilling the many needs and requirements expressed by the medical research community now and into the future. As the first step in this process we will, define requirements for editing and versioning data in DNA sequencing, research and propose data models for the computational phases of DNA sequencing and annotating DNA sequence data using existing standards, create a prototype application for DNA sequencing based SNP discovery, and engage the bioinformatics community for BioHDF adoption. In the past ten years the cost of sequencing DNA has dropped over 1000 fold and the amount of raw sequence data, entering our national repositories is doubling every 12 months. DNA sequencing is fundamental to biological research activities such as genomics, systems biology, and clinical medicine. Proposals are being sought to decrease sequencing costs by two orders of magnitude through technology refinements with an ultimate vision of developing technology to sequence human genome equivalents for $1000 each. The amount of data that will be produced through these endeavors is unimaginable. However, the $1,000 genome will not advance medical research unless we integrate all phases of the DNA sequencing process and treat the creation, management, finishing, analysis, and sharing of the data as common goals.
Agency: Department of Health and Human Services | Branch: | Program: SBIR | Phase: Phase II | Award Amount: 1.18M | Year: 2011
DESCRIPTION (provided by applicant): Next generation DNA sequencing (NGS) technologies hold great promise as tools for building a new understanding of health and disease. In the case of understanding cancer, deep sequencing provides more sensitive ways todetect the germline and somatic mutations that cause different types of cancer as well as identify new mutations within small subpopulations of tumor cells that can be prognostic indicators of tumor growth or drug resistance. Completing the transition fromproof of principal applications to practical applications, however, requires that many basic and clinical research groups to be able to effectively utilize NGS. Ongoing technical developments and intense vendor competition amongst NGS platform and serviceproviders are commoditizing data collection costs making systems more assessable. However, the single greatest impediment to the adoption of NGS technology is the lack of systems that create easy access to the immense bioinformatics and IT infrastructuresneeded to work with the data. In the case of variant analysis, such systems will need to process very large datasets, and accurately predict common, rare, and de novo levels of variation. Genetic variation must be presented in an annotation-rich, biological context to determine the clinical utility, frequency, and putative biological impact. Software systems used for this work must integrate data from many samples together with resources ranging from core analysis algorithms to application specific datasets to annotations, all woven into computational systems with interactive user interfaces (UIs). Such end-to-end systems currently do not exist. In this project, Geospiza will create integrated methods for robust detection and rich contextualization of genetic variants. Using variation analysis in cancer genomics as a model system, we will conduct research to improve assay sensitivity by deeply characterizing data from existing and emerging NGS platforms, quality value (QV) recalibration tools, and alignmentalgorithms, to understand the systematic artifacts that create errors in the data. To improve how researchers understand a variant's biological context, function and potential clinical utility, we will develop methods to combine assay results from many samples with de novo NGS datasets for assays like RNA-Seq and existing data such as those in GEO and SRA, and information resources from dbSNP, cancer genome databases, and ENCODE. Finally, we will develop the necessary scalable computing infrastructure and novel UI's needed to organize and process the data and explore and annotate the results. Through this work, and follow on product development, we will produce integrated sensitive assay systems that harness NGS for identifying very low (1:1000) levels of changes between DNA sequences to detect cancerous mutations and emerging drug resistance. Our tools and infrastructure can be later applied in assays designed to follow viral epidemics, and understand autoimmune disorders. PUBLIC HEALTH RELEVANCE: TheSBIR project Software Systems for Detecting Rare Mutations will deliver new software technologies to further advance the applications for deep DNA sequencing in personalized medicine by improving methods for detecting rare mutations that define cancer types and determine how a cancer cell may grow and respond to, or resist, treatment. In addition to improving cancer research and diagnostics, the software developed will have general use for any application where DNA sequencing is used to understand the genetic basis of human health, disease, and response to drug therapies.