Speech and Language Processing Unit

Cambridge, MA, United States

Speech and Language Processing Unit

Cambridge, MA, United States
SEARCH FILTERS
Time filter
Source Type

Ananthakrishnan S.,Speech and Language Processing Unit | Prasad R.,Speech and Language Processing Unit | Natarajan P.,Speech and Language Processing Unit
Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 | Year: 2010

The performance of phrase-based statistical machine translation (SMT) systems is crucially dependent on the quality of the extracted phrase pairs, which is in turn a function of word alignment quality. Data sparsity, an inherent problem in SMT even with large training corpora, often has an adverse impact on the reliability of the extracted phrase translation pairs. In this paper, we present a novel feature based on bootstrap resampling of the training corpus, termed phrase alignment confidence, that measures the goodness of a phrase translation pair. We integrate this feature within a phrase-based SMT system and show an improvement of 1.7% BLEU and 4.4% METEOR over a baseline English-to-Pashto (E2P) SMT system that does not use any measure of phrase pair quality. We then show that the proposed measure compares well to an existing indicator of phrase pair reliability, the lexical smoothing probability. We also demonstrate that combining the two measures leads to a further improvement of 0.4% BLEU and 0.3% METEOR on the E2P system. Commensurate translation improvements are obtained on automatic speech recognition (ASR) transcripts of the source speech utterances. © 2010 ISCA.


Ananthakrishnan S.,Speech and Language Processing Unit | Prasad R.,Speech and Language Processing Unit | Stallard D.,Speech and Language Processing Unit | Natarajan P.,Speech and Language Processing Unit
Computer Speech and Language | Year: 2013

The development of high-performance statistical machine translation (SMT) systems is contingent on the availability of substantial, in-domain parallel training corpora. The latter, however, are expensive to produce due to the labor-intensive nature of manual translation. We propose to alleviate this problem with a novel, semi-supervised, batch-mode active learning strategy that attempts to maximize in-domain coverage by selecting sentences, which represent a balance between domain match, translation difficulty, and batch diversity. Simulation experiments on an English-to-Pashto translation task show that the proposed strategy not only outperforms the random selection baseline, but also traditional active selection techniques based on dissimilarity to existing training data. © 2011 Elsevier Ltd. All rights reserved.

Loading Speech and Language Processing Unit collaborators
Loading Speech and Language Processing Unit collaborators