Time filter

Source Type

Seoul, South Korea

Kim S.,Sogang University | Yoon J.,Daumsoft | Yang J.,Sogang University | Park S.,Sogang University
BMC Bioinformatics | Year: 2010

Background: The construction of interaction networks between proteins is central to understanding the underlying biological processes. However, since many useful relations are excluded in databases and remain hidden in raw text, a study on automatic interaction extraction from text is important in bioinformatics field.Results: Here, we suggest two kinds of kernel methods for genic interaction extraction, considering the structural aspects of sentences. First, we improve our prior dependency kernel by modifying the kernel function so that it can involve various substructures in terms of (1) e-walks, (2) partial match, (3) non-contiguous paths, and (4) different significance of substructures. Second, we propose the walk-weighted subsequence kernel to parameterize non-contiguous syntactic structures as well as semantic roles and lexical features, which makes learning structural aspects from a small amount of training data effective. Furthermore, we distinguish the significances of parameters such as syntactic locality, semantic roles, and lexical features by varying their weights.Conclusions: We addressed the genic interaction problem with various dependency kernels and suggested various structural kernel scenarios based on the directed shortest dependency path connecting two entities. Consequently, we obtained promising results over genic interaction data sets with the walk-weighted subsequence kernel. The results are compared using automatically parsed third party protein-protein interaction (PPI) data as well as perfectly syntactic labeled PPI data. © 2010 Kim et al; licensee BioMed Central Ltd.

Kim S.,Sogang University | Yoon J.,Daumsoft
Journal of Biomedical Informatics | Year: 2015

Introduction: The ambiguity of biomedical abbreviations is one of the challenges in biomedical text mining systems. In particular, the handling of term variants and abbreviations without nearby definitions is a critical issue. In this study, we adopt the concepts of topic of document and word link to disambiguate biomedical abbreviations. Methods: We newly suggest the link topic model inspired by the latent Dirichlet allocation model, in which each document is perceived as a random mixture of topics, where each topic is characterized by a distribution over words. Thus, the most probable expansions with respect to abbreviations of a given abstract are determined by word-topic, document-topic, and word-link distributions estimated from a document collection through the link topic model. The model allows two distinct modes of word generation to incorporate semantic dependencies among words, particularly long form words of abbreviations and their sentential co-occurring words; a word can be generated either dependently on the long form of the abbreviation or independently. The semantic dependency between two words is defined as a link and a new random parameter for the link is assigned to each word as well as a topic parameter. Because the link status indicates whether the word constitutes a link with a given specific long form, it has the effect of determining whether a word forms a unigram or a skipping/consecutive bigram with respect to the long form. Furthermore, we place a constraint on the model so that a word has the same topic as a specific long form if it is generated in reference to the long form. Consequently, documents are generated from the two hidden parameters, i.e. topic and link, and the most probable expansion of a specific abbreviation is estimated from the parameters. Results: Our model relaxes the bag-of-words assumption of the standard topic model in which the word order is neglected, and it captures a richer structure of text than does the standard topic model by considering unigrams and semantically associated bigrams simultaneously. The addition of semantic links improves the disambiguation accuracy without removing irrelevant contextual words and reduces the parameter space of massive skipping or consecutive bigrams. The link topic model achieves 98.42% disambiguation accuracy on 73,505 MEDLINE abstracts with respect to 21 three letter abbreviations and their 139 distinct long forms. © 2014 Elsevier Inc.

Kim S.,Sogang University | Yoon J.,Daumsoft | Seo J.,Sogang University | Park S.,Sogang University
Pattern Recognition Letters | Year: 2012

This paper deals with verb-verb morphological disambiguation of two different verbs that have the same inflected form. The verb-verb morphological ambiguity (VVMA) is one of the critical Korean parts of speech (POS) tagging issues. The recognition of verb base forms related to ambiguous words highly depends on the lexical information in their surrounding contexts and the domains they occur in. However, current probabilistic morpheme-based POS tagging systems cannot handle VVMA adequately since most of them have a limitation to reflect a broad context of word level, and they are trained on too small amount of labeled training data to represent sufficient lexical information required for VVMA disambiguation. In this study, we suggest a classifier based on a large pool of raw text that contains sufficient lexical information to handle the VVMA. The underlying idea is that we automatically generate the annotated training set applicable to the ambiguity problem such as VVMA resolution via unlabeled unambiguous instances which belong to the same class. This enables to label ambiguous instances with the knowledge that can be induced from unambiguous instances. Since the unambiguous instances have only one label, the automatic generation of their annotated corpus are possible with unlabeled data. In our problem, since all conjugations of irregular verbs do not lead to the spelling changes that cause the VVMA, a training data for the VVMA disambiguation are generated via the instances of unambiguous conjugations related to each possible verb base form of ambiguous words. This approach does not require an additional annotation process for an initial training data set or a selection process for good seeds to iteratively augment a labeling set which are important issues in bootstrapping methods using unlabeled data. Thus, this can be strength against previous related works using unlabeled data. Furthermore, a plenty of confident seeds that are unambiguous and can show enough coverage for learning process are assured as well. We also suggest a strategy to extend the context information incrementally with web counts only to selected test examples that are difficult to predict using the current classifier or that are highly different from the pre-trained data set. As a result, automatic data generation and knowledge acquisition from unlabeled text for the VVMA resolution improved the overall tagging accuracy (token-level) by 0.04%. In practice, 9-10% out of verb-related tagging errors are fixed by the VVMA resolution whose accuracy was about 98% by using the Naïve Bayes classifier coupled with selective web counts. © 2011 Elsevier B.V. All rights reserved.

Daumsoft | Entity website

WHAT WE DO . , , ...

Discover hidden collaborations