Han B.,NICTA Victoria Research Laboratory |
Han B.,University of Melbourne |
Cook P.,NICTA Victoria Research Laboratory |
Baldwin T.,NICTA Victoria Research Laboratory |
Baldwin T.,University of Melbourne
EMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference | Year: 2012
Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highly-ranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing. © 2012 Association for Computational Linguistics.
PubMed | NICTA Victoria Research Laboratory and National Library of Medicine
Type: | Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium | Year: 2014
MeSH indexing of MEDLINE is becoming a more difficult task for the group of highly qualified indexing staff at the US National Library of Medicine, due to the large yearly growth of MEDLINE and the increasing size of MeSH. Since 2002, this task has been assisted by the Medical Text Indexer or MTI program. We extend previous machine learning analysis by adding a more diverse set of MeSH headings targeting examples where MTI has been shown to perform poorly. Machine learning algorithms exceed MTIs performance on MeSH headings that are used very frequently and headings for which the indexing frequency is very low. We find that when we combine the MTI suggestions and the prediction of the learning algorithms, the performance improves compared to any single method for most of the evaluated MeSH headings.