Seoul, South Korea
Seoul, South Korea

Time filter

Source Type

Kim S.,Sogang University | Yoon J.,Daumsoft
Journal of Biomedical Informatics | Year: 2015

Introduction: The ambiguity of biomedical abbreviations is one of the challenges in biomedical text mining systems. In particular, the handling of term variants and abbreviations without nearby definitions is a critical issue. In this study, we adopt the concepts of topic of document and word link to disambiguate biomedical abbreviations. Methods: We newly suggest the link topic model inspired by the latent Dirichlet allocation model, in which each document is perceived as a random mixture of topics, where each topic is characterized by a distribution over words. Thus, the most probable expansions with respect to abbreviations of a given abstract are determined by word-topic, document-topic, and word-link distributions estimated from a document collection through the link topic model. The model allows two distinct modes of word generation to incorporate semantic dependencies among words, particularly long form words of abbreviations and their sentential co-occurring words; a word can be generated either dependently on the long form of the abbreviation or independently. The semantic dependency between two words is defined as a link and a new random parameter for the link is assigned to each word as well as a topic parameter. Because the link status indicates whether the word constitutes a link with a given specific long form, it has the effect of determining whether a word forms a unigram or a skipping/consecutive bigram with respect to the long form. Furthermore, we place a constraint on the model so that a word has the same topic as a specific long form if it is generated in reference to the long form. Consequently, documents are generated from the two hidden parameters, i.e. topic and link, and the most probable expansion of a specific abbreviation is estimated from the parameters. Results: Our model relaxes the bag-of-words assumption of the standard topic model in which the word order is neglected, and it captures a richer structure of text than does the standard topic model by considering unigrams and semantically associated bigrams simultaneously. The addition of semantic links improves the disambiguation accuracy without removing irrelevant contextual words and reduces the parameter space of massive skipping or consecutive bigrams. The link topic model achieves 98.42% disambiguation accuracy on 73,505 MEDLINE abstracts with respect to 21 three letter abbreviations and their 139 distinct long forms. © 2014 Elsevier Inc.


Kim S.,Sogang University | Yoon J.,Daumsoft | Yang J.,Sogang University | Park S.,Sogang University
BMC Bioinformatics | Year: 2010

Background: The construction of interaction networks between proteins is central to understanding the underlying biological processes. However, since many useful relations are excluded in databases and remain hidden in raw text, a study on automatic interaction extraction from text is important in bioinformatics field.Results: Here, we suggest two kinds of kernel methods for genic interaction extraction, considering the structural aspects of sentences. First, we improve our prior dependency kernel by modifying the kernel function so that it can involve various substructures in terms of (1) e-walks, (2) partial match, (3) non-contiguous paths, and (4) different significance of substructures. Second, we propose the walk-weighted subsequence kernel to parameterize non-contiguous syntactic structures as well as semantic roles and lexical features, which makes learning structural aspects from a small amount of training data effective. Furthermore, we distinguish the significances of parameters such as syntactic locality, semantic roles, and lexical features by varying their weights.Conclusions: We addressed the genic interaction problem with various dependency kernels and suggested various structural kernel scenarios based on the directed shortest dependency path connecting two entities. Consequently, we obtained promising results over genic interaction data sets with the walk-weighted subsequence kernel. The results are compared using automatically parsed third party protein-protein interaction (PPI) data as well as perfectly syntactic labeled PPI data. © 2010 Kim et al; licensee BioMed Central Ltd.


Kim S.,Sogang University | Yoon J.,Daumsoft | Seo J.,Sogang University | Park S.,Sogang University
Pattern Recognition Letters | Year: 2012

This paper deals with verb-verb morphological disambiguation of two different verbs that have the same inflected form. The verb-verb morphological ambiguity (VVMA) is one of the critical Korean parts of speech (POS) tagging issues. The recognition of verb base forms related to ambiguous words highly depends on the lexical information in their surrounding contexts and the domains they occur in. However, current probabilistic morpheme-based POS tagging systems cannot handle VVMA adequately since most of them have a limitation to reflect a broad context of word level, and they are trained on too small amount of labeled training data to represent sufficient lexical information required for VVMA disambiguation. In this study, we suggest a classifier based on a large pool of raw text that contains sufficient lexical information to handle the VVMA. The underlying idea is that we automatically generate the annotated training set applicable to the ambiguity problem such as VVMA resolution via unlabeled unambiguous instances which belong to the same class. This enables to label ambiguous instances with the knowledge that can be induced from unambiguous instances. Since the unambiguous instances have only one label, the automatic generation of their annotated corpus are possible with unlabeled data. In our problem, since all conjugations of irregular verbs do not lead to the spelling changes that cause the VVMA, a training data for the VVMA disambiguation are generated via the instances of unambiguous conjugations related to each possible verb base form of ambiguous words. This approach does not require an additional annotation process for an initial training data set or a selection process for good seeds to iteratively augment a labeling set which are important issues in bootstrapping methods using unlabeled data. Thus, this can be strength against previous related works using unlabeled data. Furthermore, a plenty of confident seeds that are unambiguous and can show enough coverage for learning process are assured as well. We also suggest a strategy to extend the context information incrementally with web counts only to selected test examples that are difficult to predict using the current classifier or that are highly different from the pre-trained data set. As a result, automatic data generation and knowledge acquisition from unlabeled text for the VVMA resolution improved the overall tagging accuracy (token-level) by 0.04%. In practice, 9-10% out of verb-related tagging errors are fixed by the VVMA resolution whose accuracy was about 98% by using the Naïve Bayes classifier coupled with selective web counts. © 2011 Elsevier B.V. All rights reserved.


Song G.-Y.,Korea University | Cheon Y.,Yonsei University | Lee K.,Daumsoft | Lim H.,Korea University | And 2 more authors.
Personal and Ubiquitous Computing | Year: 2014

As various forms of social media are spreading, we often witness that an idea of an individual user drives macroscopic changes. From the perspectives of product development and marketing, the opinions left by potential consumers in online social network can generate big ripple effects. This study analyzes the user opinions in online space to grasp preferences toward various products psychologically categorized by users. We also suggest an aspect of the market mentally configured by users using network modeling while following the framework of economic sociology. Existing analyses on online market place are mainly dealing with structural issues such as inter-actor relationships and status measurement. This study, however, analyzes complex preferences regarding diverse products and brands and derives a new model for inter-market connections. We expect that our study will provide important consequences on digital marketing and community design of corporations planning word of mouth effect in online space. © 2013 Springer-Verlag London.


Won H.-H.,Samsung | Myung W.,Sungkyunkwan University | Song G.-Y.,Daumsoft | Lee W.-H.,Daumsoft | And 3 more authors.
PLoS ONE | Year: 2013

Suicide is not only an individual phenomenon, but it is also influenced by social and environmental factors. With the high suicide rate and the abundance of social media data in South Korea, we have studied the potential of this new medium for predicting completed suicide at the population level. We tested two social media variables (suicide-related and dysphoria-related weblog entries) along with classical social, economic and meteorological variables as predictors of suicide over 3 years (2008 through 2010). Both social media variables were powerfully associated with suicide frequency. The suicide variable displayed high variability and was reactive to celebrity suicide events, while the dysphoria variable showed longer secular trends, with lower variability. We interpret these as reflections of social affect and social mood, respectively. In the final multivariate model, the two social media variables, especially the dysphoria variable, displaced two classical economic predictors - consumer price index and unemployment rate. The prediction model developed with the 2-year training data set (2008 through 2009) was validated in the data for 2010 and was robust in a sensitivity analysis controlling for celebrity suicide effects. These results indicate that social media data may be of value in national suicide forecasting and prevention. © 2013 Won et al.


Woo H.,Seoul National University | Cho Y.,Seoul National University | Shim E.,Seoul National University | Lee K.,Daumsoft | Song G.,Daumsoft
International Journal of Environmental Research and Public Health | Year: 2015

The Sewol ferry disaster severely shocked Korean society. The objective of this study was to explore how the public mood in Korea changed following the Sewol disaster using Twitter data. Data were collected from daily Twitter posts from 1 January 2011 to 31 December 2013 and from 1 March 2014 to 30 June 2014 using natural language-processing and text-mining technologies. We investigated the emotional utterances in reaction to the disaster by analyzing the appearance of keywords, the human-made disaster-related keywords and suicide-related keywords. This disaster elicited immediate emotional reactions from the public, including anger directed at various social and political events occurring in the aftermath of the disaster. We also found that although the frequency of Twitter keywords fluctuated greatly during the month after the Sewol disaster, keywords associated with suicide were common in the general population. Policy makers should recognize that both those directly affected and the general public still suffers from the effects of this traumatic event and its aftermath. The mood changes experienced by the general population should be monitored after a disaster, and social media data can be useful for this purpose. © 2015 by the authors; licensee MDPI, Basel, Switzerland.


PubMed | Daumsoft and Seoul National University
Type: Journal Article | Journal: International journal of environmental research and public health | Year: 2015

The Sewol ferry disaster severely shocked Korean society. The objective of this study was to explore how the public mood in Korea changed following the Sewol disaster using Twitter data. Data were collected from daily Twitter posts from 1 January 2011 to 31 December 2013 and from 1 March 2014 to 30 June 2014 using natural language-processing and text-mining technologies. We investigated the emotional utterances in reaction to the disaster by analyzing the appearance of keywords, the human-made disaster-related keywords and suicide-related keywords. This disaster elicited immediate emotional reactions from the public, including anger directed at various social and political events occurring in the aftermath of the disaster. We also found that although the frequency of Twitter keywords fluctuated greatly during the month after the Sewol disaster, keywords associated with suicide were common in the general population. Policy makers should recognize that both those directly affected and the general public still suffers from the effects of this traumatic event and its aftermath. The mood changes experienced by the general population should be monitored after a disaster, and social media data can be useful for this purpose.


Song G.-Y.,Korea University | Cheon Y.,Yonsei University | Lee K.,Daumsoft | Park K.M.,Yonsei University | Rim H.-C.,Korea University
KSII Transactions on Internet and Information Systems | Year: 2014

Social media is considered a valuable platform for gathering and analyzing the collective and subconscious opinions of people in Internet and mobile environments, where they express, explicitly and implicitly, their daily preferences for brands and products. Extracting and tracking the various attitudes and concerns that people express through social media could enable us to categorize brands and decipher individuals' cognitive decision-making structure in their choice of brands. We investigate the cognitive network structure of consumers by building an inter-category map through the mining of big data. In so doing, we create an improved online recommendation model. Building on economic sociology theory, we suggest a framework for revealing collective preference by analyzing the patterns of brand names that users frequently mention in the online public sphere. We expect that our study will be useful for those conducting theoretical research on digital marketing strategies and doing practical work on branding strategies. © 2014 KSII.


Park S.-Y.,Sangmyung University | Byun J.,Daumsoft | Rim H.-C.,Korea University | Lee D.-G.,Korea University | Lim H.,Korea University
IEEE Transactions on Consumer Electronics | Year: 2010

In this paper, we propose a natural languagebased interface model to enable a user to articulate a request without having any specific knowledge about a mobile device. In consideration of the very limited computing and memory capacity of the mobile device and to keep the development cost low, the proposed model does not depend on typical natural language techniques, but on a ranking technique, that is simplified based on the mathematical derivation process with the following assumptions. One assumption is that a device control command consists of a function and its parameters. The other assumption is that the parameter is represented as few predictable patterns, whereas the function can be represented as various sentence patterns. To deal with these various sentence patterns, the proposed model selects the top ranked command candidate with the highest score after generating all possible candidates with their scores. Furthermore, the ranking score function is designed to achieve a high discriminative capability by the simulation of the process of generating every candidate. Experimental results show that the proposed model with 2.9 megabytes performs at 96.27% accuracy, which is slightly lower than 97.06% of the baseline model with 135.2 megabytes. © 2006 IEEE.


Loading Daumsoft collaborators
Loading Daumsoft collaborators