Entity

Time filter

Source Type


McLaren M.,Speech Technology and Research Laboratory | Ferrer L.,University of Buenos Aires | Lawson A.,Speech Technology and Research Laboratory
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | Year: 2016

Using bottleneck features extracted from a deep neural network (DNN) trained to predict senone posteriors has resulted in new, state-of-the-art technology for language and speaker identification. For language identification, the features' dense phonetic information is believed to enable improved performance by better representing language-dependent phone distributions. For speaker recognition, the role of these features is less clear, given that a bottleneck layer near the DNN output layer is thought to contain limited speaker information. In this article, we analyze the role of bottleneck features in these identification tasks by varying the DNN layer from which they are extracted, under the hypothesis that speaker information is traded for dense phonetic information as the layer moves toward the DNN output layer. Experiments support this hypothesis under certain conditions, and highlight the benefit of using a bottleneck layer close to the DNN output layer when DNN training data is matched to the evaluation conditions, and a layer more central to the DNN otherwise. © 2016 IEEE. Source


Scheffer N.,Speech Technology and Research Laboratory | Lei Y.,Speech Technology and Research Laboratory
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | Year: 2014

This work attempts to tackle the problem of content mismatch for short duration speaker verification. Experiments are run on both text-dependent and text-independent protocols, where a larger amount of enrollment data is available in the latter. We recently proposed a framework based on a deep neural network that explicitly utilizes phonetic information, and showed increased performance on long duration utterances. We show how this new framework can also yield significant improvements for short duration. We then propose an innovative approach to perform content matching, i.e. transforming a textindependent trial into a text-dependent one by mining content from a speaker's enrollment data to match the test utterance. We show how content matching can be effectively done at the statistics level to enable the use of standard verification backends. Experiments - run on the RSR2015 and NIST SRE 2010 data sets - show relative improvements of 50% for cases where the content has been said during enrollment. While no significant improvements were observed for the general text-independent case, we believe that this work might pave the way for new research for speaker verification with very short utterances. Copyright © 2014 ISCA. Source


Ferrer L.,Speech Technology and Research Laboratory | Ferrer L.,University of Buenos Aires | Ferrer L.,CONICET | Bratt H.,Speech Technology and Research Laboratory | And 4 more authors.
Speech Communication | Year: 2015

We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features. © 2015 Elsevier B.V. Source


Ferrer L.,University of Buenos Aires | McLaren M.,Speech Technology and Research Laboratory | Lawson A.,Speech Technology and Research Laboratory | Graciarena M.,Speech Technology and Research Laboratory
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | Year: 2015

We introduce a new dataset for the study of the effect of highly non-stationary noises on language recognition (LR) performance. The dataset is based on the data from the 2009 Language Recognition Evaluation organized by the National Institute of Standards and Technology (NIST). Randomly selected noises are added to these signals to achieve a chosen signal-tonoise ratio and percentage of corruption. We study the effect of these noises on LR performance as a function of these parameters and present some initial methods to mitigate the degradation, focusing on the speech activity detection (SAD) step. These methods include discarding the C0 coefficient from the features used for SAD, using a more stringent threshold on the SAD scores, thresholding the speech likelihoods returned by the model as an additional way of detecting noise, and a final model adaptation step. We show that a system optimized for clean speech is clearly suboptimal on this new dataset since the proposed methods lead to gains of up to 35% on the corrupted data, without knowledge of the test noises and with very little effect on clean data performance. Copyright © 2015 ISCA. Source


Lei Y.,Speech Technology and Research Laboratory | Ferrer L.,Speech Technology and Research Laboratory | Ferrer L.,University of Buenos Aires | McLaren M.,Speech Technology and Research Laboratory | Scheffer N.,Speech Technology and Research Laboratory
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH | Year: 2014

We recently proposed the use of deep neural networks (DNN) in place of Gaussian Mixture models (GMM) in the i-vector extraction process for speaker recognition. We have shown significant accuracy improvements on the 2012 NIST speaker recognition evaluation (SRE) telephone conditions. This paper explores how this framework can be effectively used on the microphone speech conditions of the 2012 NIST SRE. In this new framework, the verification performance greatly depends on the data used for training the DNN. We show that training the DNN using both telephone and microphone speech data can yield significant improvements. An in-depth analysis of the influence of telephone speech data on the microphone conditions is also shown for both the DNN and GMM systems. We conclude by showing that the the GMM system is always outperformed by the DNN system on the telephone-only and microphone-only conditions, and that the new DNN / i-vector framework can be successfully used providing a good match in the training data. Copyright © 2014 ISCA. Source

Discover hidden collaborations