Time filter

Source Type

Saint Petersburg, Russia

Gareev R.,Kazan Federal University | Tkachenko M.,Saint Petersburg State University | Solovyev V.,Kazan Federal University | Simanovsky A.,HP Labs Russia | Ivanov V.,National University of Science and Technology "MISIS"
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Current research efforts in Named Entity Recognition deal mostly with the English language. Even though the interest in multi-language Information Extraction is growing, there are only few works reporting results for the Russian language. This paper introduces quality baselines for the Russian NER task. We propose a corpus which was manually annotated with organization and person names. The main purpose of this corpus is to provide gold standard for evaluation. We implemented and evaluated two approaches to NER: knowledge-based and statistical. The first one comprises several components: dictionary matching, pattern matching and rule-based search of lexical representations of entity names within a document. We assembled a set of linguistic resources and evaluated their impact on performance. For the data-driven approach we utilized our implementation of a linear-chain CRF which uses a rich set of features. The performance of both systems is promising (62.17% and 75.05% F1 measure), although they do not employ morphological or syntactical analysis. © 2013 Springer-Verlag. Source

Sapozhnikov G.,Saint Petersburg State University | Ulanov A.,HP Labs Russia

In this paper we present the PHOCS-2 algorithm, which extracts a "Predicted Hierarchy Of ClassifierS". The extracted hierarchy helps us to enhance performance of flat classification. Nodes in the hierarchy contain classifiers. Each intermediate node corresponds to a set of classes and each leaf node corresponds to a single class. In the PHOCS-2 we make estimation for each node and achieve more precise computation of false positives, true positives and false negatives. Stopping criteria are based on the results of the flat classification. The proposed algorithm is validated against nine datasets. © 2012 by the authors. Source

Tkachenko M.,Saint Petersburg State University | Simanovsky A.,HP Labs Russia
11th Conference on Natural Language Processing, KONVENS 2012: Empirical Methods in Natural Language Processing - Proceedings of the Conference on Natural Language Processing 2012

We propose a domain adaptation method for supervised named entity recognition (NER). Our NER uses conditional random fields and we rank and filter out features of a new unknown domain based on the means of weights learned on known domains. We perform experiments on English texts from OntoNotes version 4 benchmark and see a statistically significant better performance on a small number of features and a convergence of performance to the maximum F 1-measure faster than conventional feature selection (information gain). We also compare with using the weights learned on a mixture of known domains. Source

Ulanov A.,HP Labs Russia | Shevlyakov G.,Saint Petersburg State Polytechnic University | Lyubomishchenko N.,HP Labs Russia | Mehraz P.,Inlogy Inc. | Polutin V.,HP Labs Russia
HP Laboratories Technical Report

The problems of taxonomy evaluation criteria comparison and corresponding benchmark creation are considered. The classes of Primitive Ideal Taxonomies (PITs), their WordNet and disrupted versions are proposed as the sets of benchmark taxonomies for the comparison of taxonomy evaluation methods. For WordNet PITs and their perturbations, the performances of the structure-based PageRank, FloorRank, and the corpusbased Information Content criteria are studied in Monte Carlo experiment. It is shown that the proposed approach can be used for the ranking of taxonomy evaluation criteria. © Copyright WeBS 2010. Source

Kiseleva J.,HP Labs Russia | Simanovsky A.,HP Labs Russia
HP Laboratories Technical Report

We describe results of experiments of extracting synonyms from large commercial site search engine query log. Our primary object is product search queries. The resulting dictionary of synonyms can be plugged into a search engine in order to improve search results quality. We use product database to extend the dictionary. © Copyright 2011 Hewlett-Packard Development Company. Source

Discover hidden collaborations