Gallagher M.E.,Lister Hill National Center for Biomedical Communications
Archiving 2013 - Final Program and Proceedings | Year: 2013
The Profiles in Science® digital library features digitized surrogates of historical items selected from the archival collections of the U.S. National Library of Medicine as well as collaborating institutions. In addition, it contains a database of descriptive, technical and administrative metadata. It also contains various software components that allow creation of the metadata, management of the digital items, and access to the items and metadata through the Profiles in Science Web site . The choices made building the digital library were designed to maximize the sustainability and long-term survival of all of the components of the digital library . For example, selecting standard and open digital file formats rather than proprietary formats increases the sustainability of the digital files . Correspondingly, using non-proprietary software may improve the sustainability of the software - either through in-house expertise or through the open source community. Limiting our digital library software exclusively to open source software or to software developed in-house has not been feasible. For example, we have used proprietary operating systems, scanning software, a search engine, and office productivity software. We did this when either lack of essential capabilities or the cost-benefit trade-off favored using proprietary software. We also did so knowing that in the future we would need to replace or upgrade some of our proprietary software, analogous to migrating from an obsolete digital file format to a new format as the technological landscape changes. Since our digital library's start in 1998, all of its software has been upgraded or replaced, but the digitized items have not yet required migration to other formats. Technological changes that compelled us to replace proprietary software included the cost of product licensing, product support, incompatibility with other software, prohibited use due to evolving security policies, and product abandonment. Sometimes these changes happen on short notice, so we continually monitor our library's software for signs of endangerment. We have attempted to replace proprietary software with suitable in-house or open source software. When the replacement involves a standalone piece of software with a nearly equivalent version, such as replacing a commercial HTTP server with an open source HTTP server, the replacement is straightforward. Recently we replaced software that functioned not only as our search engine but also as the backbone of the architecture of our Web site. In this paper, we describe the lessons learned and the pros and cons of replacing this software with open source software. © Copyright 2013; Society for Imaging Science and Technology.
Kastrin A.,College of Technological Studies |
Rindflesch T.C.,Lister Hill National Center for Biomedical Communications |
Hristovski D.,University of Ljubljana
PLoS ONE | Year: 2014
Concept associations can be represented by a network that consists of a set of nodes representing concepts and a set of edges representing their relationships. Complex networks exhibit some common topological features including small diameter, high degree of clustering, power-law degree distribution, and modularity. We investigated the topological properties of a network constructed from co-occurrences between MeSH descriptors in the MEDLINE database. We conducted the analysis on two networks, one constructed from all MeSH descriptors and another using only major descriptors. Network reduction was performed using the Pearson's chi-square test for independence. To characterize topological properties of the network we adopted some specific measures, including diameter, average path length, clustering coefficient, and degree distribution. For the full MeSH network the average path length was 1.95 with a diameter of three edges and clustering coefficient of 0.26. The Kolmogorov-Smirnov test rejects the power law as a plausible model for degree distribution. For the major MeSH network the average path length was 2.63 edges with a diameter of seven edges and clustering coefficient of 0.15. The Kolmogorov-Smirnov test failed to reject the power law as a plausible model. The power-law exponent was 5.07. In both networks it was evident that nodes with a lower degree exhibit higher clustering than those with a higher degree. After simulated attack, where we removed 10% of nodes with the highest degrees, the giant component of each of the two networks contains about 90% of all nodes. Because of small average path length and high degree of clustering the MeSH network is small-world. A power-law distribution is not a plausible model for the degree distribution. The network is highly modular, highly resistant to targeted and random attack and with minimal dissortativity.
Bekhuis T.,University of Pittsburgh |
Demner-Fushman D.,Lister Hill National Center for Biomedical Communications
Studies in Health Technology and Informatics | Year: 2010
Systematic review authors synthesize research to guide clinicians in their practice of evidence-based medicine. Teammates independently identify provisionally eligible studies by reading the same set of hundreds and sometimes thousands of citations during an initial screening phase. We investigated whether supervised machine learning methods can potentially reduce their workload. We also extended earlier research by including observational studies of a rare condition. To build training and test sets, we used annotated citations from a search conducted for an in-progress Cochrane systematic review. We extracted features from titles, abstracts, and metadata, then trained, optimized, and tested several classifiers with respect to mean performance based on 10-fold cross-validations. In the training condition, the evolutionary support vector machine (EvoSVM) with an Epanechnikov or radial kernel is the best classifier: mean recall=100%; mean precision=48% and 41%, respectively. In the test condition, EvoSVM performance degrades: mean recall=77%, mean precision ranges from 26% to 37%. Because near-perfect recall is essential in this context, we conclude that supervised machine learning methods may be useful for reducing workload under certain conditions. © 2010 IMIA and SAHIA. All rights reserved.
Simpson M.S.,Lister Hill National Center for Biomedical Communications
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium | Year: 2012
Image content is frequently the target of biomedical information extraction systems. However, the meaning of this content cannot be easily understood without some associated text. In order to improve the integration of textual and visual information, we are developing a visual ontology for biomedical image retrieval. Our visual ontology maps the appearance of image regions to concepts in an existing textual ontology, thereby inheriting relationships among the visual entities. Such a resource creates a bridge between the visual characteristics of important image regions and their semantic interpretation. We automatically populate our visual ontology by pairing image regions with their associated descriptions. To demonstrate the usefulness of this resource, we have developed a classification method that automatically labels image regions with appropriate concepts based solely on their appearance. Our results for thoracic imaging terms show that our methods are promising first steps towards the creation of a biomedical visual ontology.
Bekhuis T.,University of Pittsburgh |
Demner-Fushman D.,Lister Hill National Center for Biomedical Communications |
Crowley R.,University of Pittsburgh
Journal of the Medical Library Association | Year: 2013
Objectives: We analyzed the extent to which comparative effectiveness research (CER) organizations share terms for designs, analyzed coverage of CER designs in Medical Subject Headings (MeSH) and Emtree, and explored whether scientists use CER design terms. Methods: We developed local terminologies (LTs) and a CER design terminology by extracting terms in documents from five organizations. We defined coverage as the distribution over match type in MeSH and Emtree. We created a crosswalk by recording terms to which design terms mapped in both controlled vocabularies. We analyzed the hits for queries restricted to titles and abstracts to explore scientists' language. Results: Pairwise LT overlap ranged from 22.64% (12/53) to 75.61% (31/41). The CER design terminology (n578 terms) consisted of terms for primary study designs and a few terms useful for evaluating evidence, such as opinion paper and systematic review. Patterns of coverage were similar in MeSH and Emtree (gamma50.581, P50.002). Conclusions: Stakeholder terminologies vary, and terms are inconsistently covered in MeSH and Emtree. The CER design terminology and crosswalk may be useful for expert searchers. For partially mapped terms, queries could consist of free text for modifiers such as nonrandomized or interrupted added to broad or related controlled terms.