The Data Detektiv

Waltham, MA, United States

The Data Detektiv

Waltham, MA, United States
Time filter
Source Type

Deans A.R.,Pennsylvania State University | Lewis S.E.,Lawrence Berkeley National Laboratory | Huala E.,Carnegie Institution for Science | Anzaldo S.S.,Arizona State University | And 70 more authors.
PLoS Biology | Year: 2015

Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility. © 2015 Deans et al.

Thessen A.E.,The Data Detektiv | Thessen A.E.,Ronin Institute for Independent Scholarship | Fertig B.,Ronin Institute for Independent Scholarship | Fertig B.,Versar Inc. | And 2 more authors.
Estuaries and Coasts | Year: 2016

Holistic understanding of estuarine and coastal environments across interacting domains with high-dimensional complexity can profitably be approached through data-centric synthesis studies. Synthesis has been defined as “the inferential process whereby new models are developed from analysis of multiple data sets to explain observed patterns across a range of time and space scales.” Examples include ecological—across ecosystem components or organization levels, spatial—across spatial scales or multiple ecosystems, and temporal—across temporal scales. Though data quantity and volume are increasingly accessible, infrastructures for data sharing, management, and integration remain fractured. Integrating heterogeneous data sets is difficult yet critical. Technological and cultural obstacles hamper finding, accessing, and integrating data to answer scientific and policy questions. To investigate synthesis within the estuarine and coastal science community, we held a workshop at a coastal and estuarine research federation conference and conducted two case studies involving synthesis science. The workshop indicated that data-centric synthesis approaches are valuable for (1) hypothesis testing, (2) baseline monitoring, (3) historical perspectives, and (4) forecasting. Case studies revealed important weaknesses in current data infrastructures and highlighted opportunities for ecological synthesis science. Here, we list requirements for a coastal and estuarine data infrastructure. We model data needs and suggest directions for moving forward. For example, we propose developing community standards, accommodating and integrating big and small data (e.g., sensor feeds and single data sets), and digitizing ‘dark data’ (inaccessible, non-curated, non-archived data potentially destroyed when researchers leave science). © 2015, Coastal and Estuarine Research Federation.

Thessen A.E.,The Data Detektiv | Thessen A.E.,University of Cambridge | Thessen A.E.,The Ronin Institute for Independent Scholarship | McGinnis S.,University of Cambridge | North E.W.,University of Cambridge
Computers and Geosciences | Year: 2016

Process studies and coupled-model validation efforts in geosciences often require integration of multiple data types across time and space. For example, improved prediction of hydrocarbon fate and transport is an important societal need which fundamentally relies upon synthesis of oceanography and hydrocarbon chemistry. Yet, there are no publically accessible databases which integrate these diverse data types in a georeferenced format, nor are there guidelines for developing such a database. The objective of this research was to analyze the process of building one such database to provide baseline information on data sources and data sharing and to document the challenges and solutions that arose during this major undertaking. The resulting Deepwater Horizon Database was approximately 2.4. GB in size and contained over 8 million georeferenced data points collected from industry, government databases, volunteer networks, and individual researchers. The major technical challenges that were overcome were reconciliation of terms, units, and quality flags which were necessary to effectively integrate the disparate data sets. Assembling this database required the development of relationships with individual researchers and data managers which often involved extensive e-mail contacts. The average number of emails exchanged per data set was 7.8. Of the 95 relevant data sets that were discovered, 38 (40%) were obtained, either in whole or in part. Over one third (36%) of the requests for data went unanswered. The majority of responses were received after the first request (64%) and within the first week of the first request (67%). Although fewer than half of the potentially relevant datasets were incorporated into the database, the level of sharing (40%) was high compared to some other disciplines where sharing can be as low as 10%. Our suggestions for building integrated databases include budgeting significant time for e-mail exchanges, being cognizant of the cost versus benefits of pursuing reticent data providers, and building trust through clear, respectful communication and with flexible and appropriate attributions. © 2015 Elsevier Ltd.

Quintero E.,Subcoordinacion de Especies Prioritarias | Thessen A.E.,The Data Detektiv | Thessen A.E.,The Ronin Institute for Independent Scholarship | Arias-Caballero P.,Subcoordinacion de Especies Prioritarias | Ayala-Orozco B.,Subcoordinacion de Especies Prioritarias
PeerJ | Year: 2014

Background. Mexico has the world's fifth largest population of amphibians and the second country with the highest quantity of threatened amphibian species. About 10% ofMexican amphibians lack enough data to be assigned to a risk category by the IUCN, so in this paper we want to test a statistical tool that, in the absence of specific demographic data, can assess a species' risk of extinction, population trend, and to better understand which variables increase their vulnerability. Recent studies have demonstrated that the risk of species decline depends on extrinsic and intrinsic traits, thus including both of them for assessing extinction might render more accurate assessment of threats. Methods.We harvested data fromthe Encyclopedia of Life (EOL) and the published literature for Mexican amphibians, and used these data to assess the population trend of some of theMexican species that have been assigned to the Data Deficient category of the IUCN using RandomForests, aMachine Learning method that gives a prediction of complex processes and identifies the most important variables that account for the predictions. Results. Our results show that most of the data deficient Mexican amphibians that we used have decreasing population trends.We found that RandomForests is a solid way to identify species with decreasing population trends when no demographic data is available. Moreover, we point to the most important variables that make species more vulnerable for extinction. This exercise is a very valuable first step in assigning conservation priorities for poorly known species. © 2014 Quintero et al.

Loading The Data Detektiv collaborators
Loading The Data Detektiv collaborators