Database Center for Life Science

Kashiwa, Japan

Database Center for Life Science

Kashiwa, Japan
Time filter
Source Type

Yamamoto Y.,Database Center for Life Science | Yamaguchi A.,Database Center for Life Science | Splendiani A.,A BioHackathon Participant
CEUR Workshop Proceedings | Year: 2016

A consequence of the increasing amount of information available in RDF is that it is getting harder, for users, to find which sources (often SPARQL endpoint) are the most appropriate, reliable and up to date for some sought information. Here we introduce YummyData, a service that monitors and assess the "quality" of endpoints providing datasets of interest to the biomedical community. It helps biomedical researchers in two ways: by providing a curated list of endpoints and by enriching it with information on their availability, updates rate, standard compliance, and other features that are relevant to users. Since we believe this assessment is valuable for both researchers or consumers and providers of biomedical RDF data, YummyData provides a forum where they can communicate and improve the usability of the web of (bio) data.

Kawano S.,Database Center for Life Science | Ono H.,Database Center for Life Science | Takagi T.,University of Tokyo | Bono H.,Database Center for Life Science
Briefings in Bioinformatics | Year: 2012

In recent years, biological web resources such as databases and tools have become more complex because of the enormous amounts of data generated in the field of life sciences. Traditional methods of distributing tutorials include publishing textbooks and posting web documents, but these static contents cannot adequately describe recent dynamic web services. Due to improvements in computer technology, it is now possible to create dynamic content such as video with minimal effort and low cost on most modern computers. The ease of creating and distributing video tutorials instead of static content improves accessibility for researchers, annotators and curators. This article focuses on online video repositories for educational and tutorial videos provided by resource developers and users. It also describes a project in Japan named TogoTV ( and discusses the production and distribution of high-quality tutorial videos, which would be useful to viewer, with examples. This article intends to stimulate and encourage researchers who develop and use databases and tools to distribute how-to videos as a tool to enhance product usability. © The Author(s) 2011. Published by Oxford University Press.

Kodama Y.,National Institute of Genetics | Mashima J.,National Institute of Genetics | Kosuge T.,National Institute of Genetics | Katayama T.,Database Center for Life Science | And 7 more authors.
Nucleic Acids Research | Year: 2015

The DNA Data Bank of Japan Center (DDBJ Center; maintains and provides public archival, retrieval and analytical services for biological information. Since October 2013, DDBJ Center has operated the Japanese Genotypephenotype Archive (JGA) in collaboration with our partner institute, the National Bioscience Database Center (NBDC) of the Japan Science and Technology Agency. DDBJ Center provides the JGA database system which securely stores genotype and phenotype data collected from individuals whose consent agreements authorize data release only for specific research use. NBDC has established guidelines and policies for sharing human-derived data and reviews data submission and usage requests from researchers. In addition to the JGA project, DDBJ Center develops Semantic Web technologies for data integration and sharing in collaboration with the Database Center for Life Science. This paper describes the overview of the JGA project, updates to the DDBJ databases, and services for data retrieval, analysis and integration. © The Author(s) 2014.

Naito Y.,Database Center for Life Science | Hino K.,University of Tokyo | Bono H.,Database Center for Life Science | Bono H.,National Institute of Genetics | Ui-Tei K.,University of Tokyo
Bioinformatics | Year: 2014

CRISPRdirect is a simple and functional web server for selecting rational CRISPR/Cas targets from an input sequence. The CRISPR/Cas system is a promising technique for genome engineering which allows target-specific cleavage of genomic DNA guided by Cas9 nuclease in complex with a guide RNA (gRNA), that complementarily binds to a ∼20 nt targeted sequence. The target sequence requirements are twofold. First, the 5′-NGG protospacer adjacent motif (PAM) sequence must be located adjacent to the target sequence. Second, the target sequence should be specific within the entire genome in order to avoid off-target editing. CRISPRdirect enables users to easily select rational target sequences with minimized off-target sites by performing exhaustive searches against genomic sequences. The server currently incorporates the genomic sequences of human, mouse, rat, marmoset, pig, chicken, frog, zebrafish, Ciona, fruit fly, silkworm, Caenorhabditis elegans, Arabidopsis, rice, Sorghum and budding yeast. © The Author 2014.

Yamamoto Y.,Database Center for Life Science | Kawamoto S.,Database Center for Life Science
CEUR Workshop Proceedings | Year: 2012

There is a growing need for efficient and integrated access to databases provided by diverse institutions. Using a linked data design pattern allows the diverse data on the Internet to be linked effectively and accessed efficiently by computers. In addition, providing a dictionary to translate words into another language in Resource Description Framework (RDF) is useful to cross a language barrier such as English and Japanese when we want to access datasets in multiple languages. Here, we built a Linked Open Dataset of the Life Science Dictionary (LSD) with links to DBpedia. LSD consists of various lexical resources including English-Japanese / Japanese-English dictionaries and a thesaurus using the MeSH vocabulary. The latest version of LSD contains 110 thousand English and 120 thousand Japanese terms. Since we believe that LSD is a useful language resource in the life science domain to process Japanese and English text data seamlessly, linking LSD to DBpedia enables us to find related knowledge more easily and therefore contributes to the life science research community.

Kim J.D.,Database Center for Life Science
BMC bioinformatics | Year: 2012

The Genia task, when it was introduced in 2009, was the first community-wide effort to address a fine-grained, structural information extraction from biomedical literature. Arranged for the second time as one of the main tasks of BioNLP Shared Task 2011, it aimed to measure the progress of the community since 2009, and to evaluate generalization of the technology to full text papers. The Protein Coreference task was arranged as one of the supporting tasks, motivated from one of the lessons of the 2009 task that the abundance of coreference structures in natural language text hinders further improvement with the Genia task. The Genia task received final submissions from 15 teams. The results show that the community has made a significant progress, marking 74% of the best F-score in extracting bio-molecular events of simple structure, e.g., gene expressions, and 45% ~ 48% in extracting those of complex structure, e.g., regulations. The Protein Coreference task received 6 final submissions. The results show that the coreference resolution performance in biomedical domain is lagging behind that in newswire domain, cf. 50% vs. 66% in MUC score. Particularly, in terms of protein coreference resolution the best system achieved 34% in F-score. Detailed analysis performed on the results improves our insight into the problem and suggests the directions for further improvements.

Nguyen N.,Chiyoda Corporation | Kim J.-D.,Database Center for Life Science | Miwa M.,University of Manchester | Matsuzaki T.,Chiyoda Corporation | Tsujii J.,Microsoft
BMC Bioinformatics | Year: 2012

Background: Current research has shown that major difficulties in event extraction for the biomedical domain are traceable to coreference. Therefore, coreference resolution is believed to be useful for improving event extraction. To address coreference resolution in molecular biology literature, the Protein Coreference (COREF) task was arranged in the BioNLP Shared Task (BioNLP-ST, hereafter) 2011, as a supporting task. However, the shared task results indicated that transferring coreference resolution methods developed for other domains to the biological domain was not a straight-forward task, due to the domain differences in the coreference phenomena.Results: We analyzed the contribution of domain-specific information, including the information that indicates the protein type, in a rule-based protein coreference resolution system. In particular, the domain-specific information is encoded into semantic classification modules for which the output is used in different components of the coreference resolution. We compared our system with the top four systems in the BioNLP-ST 2011; surprisingly, we found that the minimal configuration had outperformed the best system in the BioNLP-ST 2011. Analysis of the experimental results revealed that semantic classification, using protein information, has contributed to an increase in performance by 2.3% on the test data, and 4.0% on the development data, in F-score.Conclusions: The use of domain-specific information in semantic classification is important for effective coreference resolution. Since it is difficult to transfer domain-specific information across different domains, we need to continue seek for methods to utilize such information in coreference resolution. © 2012 Nguyen et al.; licensee BioMed Central Ltd.

Iwasaki W.,University of Tokyo | Yamamoto Y.,Database Center for Life Science | Takagi T.,University of Tokyo | Takagi T.,Database Center for Life Science | Takagi T.,National Institute of Genetics
PLoS ONE | Year: 2010

In this paper, we describe a server/client literature management system specialized for the life science domain, the TogoDoc system (Togo, pronounced Toe-Go, is a romanization of a Japanese word for integration). The server and the client program cooperate closely over the Internet to provide life scientists with an effective literature recommendation service and efficient literature management. The content-based and personalized literature recommendation helps researchers to isolate interesting papers from the "tsunami" of literature, in which, on average, more than one biomedical paper is added to MEDLINE every minute. Because researchers these days need to cover updates of much wider topics to generate hypotheses using massive datasets obtained from public databases or omics experiments, the importance of having an effective literature recommendation service is rising. The automatic recommendation is based on the content of personal literature libraries of electronic PDF papers. The client program automatically analyzes these files, which are sometimes deeply buried in storage disks of researchers′ personal computers. Just saving PDF papers to the designated folders makes the client program automatically analyze and retrieve metadata, rename file names, synchronize the data to the server, and receive the recommendation lists of newly published papers, thus accomplishing effortless literature management. In addition, the tag suggestion and associative search functions are provided for easy classification of and access to past papers (researchers who read many papers sometimes only vaguely remember or completely forget what they read in the past). The TogoDoc system is available for both Windows and Mac OS X and is free. The TogoDoc Client software is available at, and the TogoDoc server is available at © 2010 Iwasaki et al.

Term clustering, by measuring the string similarities between terms, is known within the natural language processing community to be an effective method for improving the quality of texts and dictionaries. However, we have observed that chemical names are difficult to cluster using string similarity measures. In order to clearly demonstrate this difficulty, we compared the string similarities determined using the edit distance, the Monge-Elkan score, SoftTFIDF, and the bigram Dice coefficient for chemical names with those for non-chemical names. Our experimental results revealed the following: (1) The edit distance had the best performance in the matching of full forms, whereas Cohen et al. reported that SoftTFIDF with the Jaro-Winkler distance would yield the best measure for matching pairs of terms for their experiments. (2) For each of the string similarity measures above, the best threshold for term matching differs for chemical names and for non-chemical names; the difference is especially large for the edit distance. (3) Although the matching results obtained for chemical names using the edit distance, Monge-Elkan scores, or the bigram Dice coefficients are better than the result obtained for non-chemical names, the results were contrary when using SoftTFIDF. (4) A suitable weight for chemical names varies substantially from one for non-chemical names. In particular, a weight vector that has been optimized for non-chemical names is not suitable for chemical names. (5) The matching results using the edit distances improve further by dividing a set of full forms into two subsets, according to whether a full form is a chemical name or not. These results show that our hypothesis is acceptable, and that we can significantly improve the performance of abbreviation-full form clustering by computing chemical names and non-chemical names separately. In conclusion, the discriminative application of string similarity methods to chemical and non-chemical names may be a simple yet effective way to improve the performance of term clustering.

Yamamoto Y.,Database Center for Life Science | Yamaguchi A.,Database Center for Life Science | Bono H.,Database Center for Life Science | Takagi T.,University of Tokyo
Database | Year: 2011

Many abbreviations are used in the literature especially in the life sciences, and polysemous abbreviations appear frequently, making it difficult to read and understand scientific papers that are outside of a reader's expertise. Thus, we have developed Allie, a database and a search service of abbreviations and their long forms (a.k.a. full forms or definitions). Allie searches for abbreviations and their corresponding long forms in a database that we have generated based on all titles and abstracts in MEDLINE. When a user query matches an abbreviation, Allie returns all potential long forms of the query along with their bibliographic data (i.e. title and publication year). In addition, for each candidate, co-occurring abbreviations and a research field in which it frequently appears in the MEDLINE data are displayed. This function helps users learn about the context in which an abbreviation appears. To deal with synonymous long forms, we use a dictionary called GENA that contains domain-specific terms such as gene, protein or disease names along with their synonymic information. Conceptually identical domain-specific terms are regarded as one term, and then conceptually identical abbreviation-long form pairs are grouped taking into account their appearance in MEDLINE. To keep up with new abbreviations that are continuously introduced, Allie has an automatic update system. In addition, the database of abbreviations and their long forms with their corresponding PubMed IDs is constructed and updated weekly. © The Author(s) 2011.

Loading Database Center for Life Science collaborators
Loading Database Center for Life Science collaborators