National Center for Text Mining

Manchester, United Kingdom

National Center for Text Mining

Manchester, United Kingdom
Time filter
Source Type

Kemper B.,University of Tokyo | Matsuzaki T.,University of Tokyo | Matsuoka Y,The Systems Biology Institute | Tsuruoka Y.,National Center for Text Mining | And 8 more authors.
Bioinformatics | Year: 2010

Motivation: Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this article, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by Systems Biology Institute, Okinawa Institute of Science and Technology, National Centre for Text Mining (University of Manchester) and the University of Tokyo, and is being used by groups of biologists from these locations. Contact: © The Author(s) 2010. Published by Oxford University Press.

Sutcliffe A.,University of Manchester | Thew S.,University of Manchester | De Bruijn O.,University of Manchester | Buchan I.,Northwest Institute for Bio Health Informatics NIHBI | And 3 more authors.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | Year: 2010

This paper describes the application of user-centred design (UCD) methods and a user engagement (UE) approach to a case study development of a visualization tool (ADVISES) to support epidemiological research. The combined UCD/UE approach consisted of scenario-based design, and analysis of the users' tasks and mental model of the domain. Prototyping and storyboarding techniques were used to explore design options with users as well as specifying functionality for two versions of the software to meet the needs of novice and expert users. An evaluation of the prototype was carried out to assess the extent to which the expert model would support public health professionals in their analysis activities. The results of the design exploration requirements analysis study are reported. The implications of scenario-based design exploration, participatory design and user engagement are discussed. © 2010 The Royal Society.

Miwa M.,University of Tokyo | Saetre R.,University of Tokyo | Kim J.-D.,University of Tokyo | Tsujii J.,University of Tokyo | And 2 more authors.
Journal of Bioinformatics and Computational Biology | Year: 2010

Biomedical Natural Language Processing (BioNLP) attempts to capture biomedical phenomena from texts by extracting relations between biomedical entities (i.e. proteins and genes). Traditionally, only binary relations have been extracted from large numbers of published papers. Recently, more complex relations (biomolecular events) have also been extracted. Such events may include several entities or other relations. To evaluate the performance of the text mining systems, several shared task challenges have been arranged for the BioNLP community. With a common and consistent task setting, the BioNLP'09 shared task evaluated complex biomolecular events such as binding and regulation.Finding these events automatically is important in order to improve biomedical event extraction systems. In the present paper, we propose an automatic event extraction system, which contains a model for complex events, by solving a classification problem with rich features. The main contributions of the present paper are: (1) the proposal of an effective bio-event detection method using machine learning, (2) provision of a high-performance event extraction system, and (3) the execution of a quantitative error analysis. The proposed complex (binding and regulation) event detector outperforms the best system from the BioNLP'09 shared task challenge. © 2010 2010 The Authors.

Wu X.,University of Tokyo | Matsuzaki T.,University of Tokyo | Tsujii J.,University of Tokyo | Tsujii J.,University of Manchester | Tsujii J.,National Center for Text Mining
Machine Translation | Year: 2010

This paper introduces deep syntactic structures to syntax-based Statistical Machine Translation (SMT). We use a Head-driven Phrase Structure Grammar (HPSG) parser to obtain the deepsyntacticstructures of a sentence, which include not only a fine-grained syntactic property description but also a semantic representation. Considering the abundant information included in the deep syntacticstructures,it is interesting to investigate whether or not they improve the traditional syntax-based translation models based on PCFG parsers. In order to use deep syntactic structures for SMT, this paperfocuses onextracting tree-to-string translation rules from aligned HPSG tree-string pairs. The major challenge is to properly localize the non-local relations among nodes in an HPSG tree. To localize thesemanticdependencies among words and phrases, which can be inherently non-local, a minimum covering tree is defined by taking a predicate word and its lexical/phrasal arguments as the frontier nodes.Starting fromthis definition, a linear-time algorithm is proposed to extract translation rules through one-time traversal of the leaf nodes in an HPSG tree. Extensive experiments on a tree-to-string translationsystem testifiedthe effectiveness of our proposal. © 2010 Springer Science+Business Media B.V.

Miwa M.,University of Tokyo | Saetre R.,University of Tokyo | Miyao Y.,National Institute of Informatics | Tsujii J.,University of Tokyo | And 2 more authors.
Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference | Year: 2010

Relations between entities in text have been widely researched in the natural language processing and information extraction communities. The region connecting a pair of entities (in a parsed sentence) is often used to construct kernels or feature vectors that can recognize and extract interesting relations. Such regions are useful, but they can also incorporate unnecessary distracting information. In this paper, we propose a rule based method to remove the information that is unnecessary for relation extraction. Protein-protein interaction (PPI) is used as an example relation extraction problem. A dozen simple rules are defined on output from a deep parser. Each rule specifically examines the entities in one target interaction pair. These simple rules were tested using several PPI corpora. The PPI extraction performance was improved on all the PPI corpora.

Thew S.L.,University of Manchester | Sutcliffe A.,University of Manchester | De Bruijn O.,University of Manchester | McNaught J.,National Center for Text Mining | And 3 more authors.
Methods of Information in Medicine | Year: 2011

Background and Objectives: We present a prototype visualisation tool, ADVISES (Adaptive Visualization for e-Science), designed to support epidemiologists and public health practitioners in exploring geo-coded datasets and generating spatial epidemiological hy - potheses. The tool is designed to support creative thinking while providing the means for the user to evaluate the validity of the visualization in terms of statistical uncertainty. We present an overview of the application and the results of an evaluation exploring public health researchers' responses to maps as a new way of viewing familiar data, in particular the use of thematic maps with adjoining descriptive statistics and forest plots to support the generation and evaluation of new hypotheses. Methods: A series of qualitative evaluations involved one experienced researcher asking 21 volunteers to interact with the system to perform a series of relatively complex, realistic map-building and exploration tasks, using a 'think aloud' protocol, followed by a semi-structured interview The volunteers were academic epidemiologists and UK National Health Service analysts. Results: All users quickly and confidently created maps, and went on to spend substantial amounts of time exploring and interacting with system, generating hypotheses about their maps. Conclusions: Our findings suggest that the tool is able to support creativity and statistical appreciation among public health professionals and epidemiologists building thematic maps. Software such as this, introduced appropriately, could increase the capability of existing personnel for generating public health intelligence. © Schattauer 2011.

Miwa M.,University of Tokyo | Pyysalo S.,University of Tokyo | Hara T.,University of Tokyo | Tsujii J.,University of Tokyo | And 2 more authors.
Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference | Year: 2010

The detailed analyses of sentence structure provided by parsers have been applied to address several information extraction tasks. In a recent bio-molecular event extraction task, state-of-the-art performance was achieved by systems building specifically on dependency representations of parser output. While intrinsic evaluations have shown significant advances in both general and domain-specific parsing, the question of how these translate into practical advantage is seldom considered. In this paper, we analyze how event extraction performance is affected by parser and dependency representation, further considering the relation between intrinsic evaluation and performance at the extraction task. We find that good intrinsic evaluation results do not always imply good extraction performance, and that the types and structures of different dependency representations have specific advantages and disadvantages for the event extraction task.

Ananiadou S.,University of Manchester | Ananiadou S.,National Center for Text Mining | Thompson P.,University of Manchester | Nawaz R.,University of Manchester | And 2 more authors.
Briefings in Functional Genomics | Year: 2015

The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of 'events', i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research. © The Author 2014. Published by Oxford University Press. All rights reserved.

Kano Y.,University of Tokyo | Dobson P.,University of Manchester | Nakanishi M.,University of Tokyo | Tsujii J.,University of Tokyo | And 4 more authors.
Bioinformatics | Year: 2010

Summary: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare to Taverna, a generic workflow system, to expose text mining functionality to the bioinformatics community. © The Author(s) 2010. Published by Oxford University Press.

Andrade D.,University of Tokyo | Matsuzaki T.,University of Tokyo | Tsujii J.,University of Manchester | Tsujii J.,National Center for Text Mining
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2011

Existing dictionaries may be effectively enlarged by finding the translations of single words, using comparable corpora. The idea is based on the assumption that similar words have similar contexts across multiple languages. However, previous research suggests the use of a simple bag-of-words model to capture the lexical context, or assumes that sufficient context information can be captured by the successor and predecessor of the dependency tree. While the latter may be sufficient for a close language-pair, we observed that the method is insufficient if the languages differ significantly, as is the case for Japanese and English. Given a query word, our proposed method uses a statistical model to extract relevant words, which tend to co-occur in the same sentence; additionally our proposed method uses three statistical models to extract relevant predecessors, successors and siblings in the dependency tree. We then combine the information gained from the four statistical models, and compare this lexical-dependency information across English and Japanese to identify likely translation candidates. Experiments based on openly accessible comparable corpora verify that our proposed method can increase Top 1 accuracy statistically significantly by around 13 percent points to 53%, and Top 20 accuracy to 91%. © 2011 Springer-Verlag.

Loading National Center for Text Mining collaborators
Loading National Center for Text Mining collaborators