Colorado, Colorado, United States
Colorado, Colorado, United States

Time filter

Source Type

Gonzalez G.,Arizona State University | Cohen K.B.,Computational Bioscience Program | Kann M.G.,University of Maryland Baltimore County | Leaman R.,U.S. National Center for Biotechnology Information | And 3 more authors.
18th Pacific Symposium on Biocomputing, PSB 2013 | Year: 2013

The biggest challenge for text and data mining is to truly impact the biomedical discovery process, enabling scientists to generate novel hypothesis to address the most crucial questions. Among a number of worthy submissions, we have selected six papers that exemplify advances in text and data mining methods that have a demonstrated impact on a wide range of applications. Work presented in this session includes data mining techniques applied to the discovery of 3-way genetic interactions and to the analysis of genetic data in the context of electronic medical records (EMRs), as well as an integrative approach that combines data from genetic (SNP) and transcriptomic (microarray) sources for clinical prediction. Text mining advances include a classification method to determine whether a published article contains pharmacological experiments relevant to drug-drug interactions, a fine-grained text mining approach for detecting the catalytic sites in proteins in the biomedical literature, and a method for automatically extending a taxonomy of health related terms to integrate consumer-friendly synonyms for medical terminologies.

Temnikova I.P.,Bulgarian Academy of Science | Hailu N.D.,Computational Bioscience Program | Angelova G.,Bulgarian Academy of Science | Cohen K.B.,Computational Bioscience Program
International Conference Recent Advances in Natural Language Processing, RANLP | Year: 2013

Patent search is an important information retrieval problem in scientific and business research. Semantic search would be a large improvement to current technologies, but requires some insight into the language of patents. In this article we test the fit of the language of patents to the sublanguage model, focussing on closure properties. The research presented here is relevant to the topic of sublanguage identification for different domains, and to the study of the language of patents. We investigate the hypothesis that fit to the sublanguage model increases as one moves down the International Patent Classification hierarchy. The analysis employs a general English corpus and patent documents from the MAREC corpus. It is shown that patents generally fit the sublanguage model, with some variability between categories in the extent of the fit.

Temnikova I.P.,Bulgarian Academy of Science | Nikolova I.,Bulgarian Academy of Science | Baumgartner Jr. W.A.,Computational Bioscience Program | Angelova G.,Bulgarian Academy of Science | Cohen K.B.,Computational Bioscience Program
International Conference Recent Advances in Natural Language Processing, RANLP | Year: 2013

Sublanguages are specialized genres of language associated with specific domains and document types. When sublanguages can be recognized and adequately characterized, they are useful for a variety of types of natural language processing applications. Although there are sublanguage studies related to languages other than English, all previous work on sublanguage recognition has focused on sublanguages related to general English. This paper tests whether a sublanguage detecting technique developed for English can be applied to another language. Bulgarian clinical documents are an excellent test case, because of a number of unique linguistic properties that affect their lexical and morphological characteristics. Bulgarian clinical documents were studied with respect to their closure properties and were found to fit the sublanguage model and exhibit characteristics like those noted for sublanguages related to English. It was also confirmed that the clinical sublanguage phenomenon is not a coincidental phenomenon of English, but applies to other languages as well. Implications of this fact for natural language processing are proposed.

Bretonnel Cohen K.,Computational Bioscience Program | Bretonnel Cohen K.,University of Colorado at Boulder | Christiansen T.,Computational Bioscience Program | Hunter L.E.,Computational Bioscience Program
NIST Special Publication | Year: 2011

The goal of this work was to establish a reasonable baseline for research in patient cohort retrieval from clinical free text. Much recent work has used Lucene for this purpose. Our approach was to use MetaMap alone. We found that although many TREC 2011 Electronic Medical Records track participants found it difficult to beat a Lucene baseline, our MetaMap-based baseline did outperform a number of Lucene runs. We propose that MetaMap is a more valid baseline than Lucene, providing essential concept extraction, and that failure to make use of this industry-standard tool results in an unfairly low baseline for evaluation of system outputs.

Verspoor K.,Computational Bioscience Program | Cohen K.B.,Computational Bioscience Program | Cohen K.B.,University of Colorado at Boulder | Lanfranchi A.,University of Colorado at Boulder | And 12 more authors.
BMC Bioinformatics | Year: 2012

Background: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.Results: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.Conclusions: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications. © 2012 Verspoor et al.; licensee BioMed Central Ltd.

PubMed | Computational Bioscience Program
Type: | Journal: BMC bioinformatics | Year: 2014

Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem.Three dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented.Baselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14-0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.

Loading Computational Bioscience Program collaborators
Loading Computational Bioscience Program collaborators