Time filter

Source Type

Astudillo R.F.,Spoken Language Systems Laboratory | Gerkmann T.,University of Oldenburg
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | Year: 2013

The Gaussian distortion model in the short-time Fourier transform (STFT) domain is the basis of many of the modern speech enhancement algorithms. One of the reasons is that additive sources and late reverberation can be analyzed and processed quite efficiently in this domain. The STFT domain is however not well related to acoustic quality and is also not well suited for learning models due to the high variability of speech in this domain. On the other hand, the cepstral domain has proved to be very well suited for these last two purposes, however, at the cost of loosing the simple linear relation between desired source and additive interferences. In this paper we explore the relation between the Gaussian distortion models in the STFT and the cepstral domain. We show how the assumption of a jointly Gaussian distortion model in the cepstrum domain is fulfilled for well-known distortion models in STFT domain. We provide closed-form solutions relating the joint distributions of corrupted and clean speech in the STFT and the cepstrum domain. We also propose various ways in which this model can be used to enhance speech. © 2013 IEEE.

Astudillo R.F.,Spoken Language Systems Laboratory | Orglmeister R.,TU Berlin
IEEE Transactions on Audio, Speech and Language Processing | Year: 2013

In this paper we demonstrate how uncertainty propagation allows the computation of minimum mean square error (MMSE) estimates in the feature domain for various feature extraction methods using short-time Fourier transform (STFT) domain distortion models. In addition to this, a measure of estimate reliability is also attained which allows either feature re-estimation or the dynamic compensation of automatic speech recognition (ASR) models. The proposed method transforms the posterior distribution associated to a Wiener filter through the feature extraction using the STFT Uncertainty Propagation formulas. It is also shown that non-linear estimators in the STFT domain like the Ephraim-Malah filters can be seen as special cases of a propagation of the Wiener posterior. The method is illustrated by developing two MMSE-Mel-frequency Cepstral Coefficient (MFCC) estimators and combining them with observation uncertainty techniques. We discuss similarities with other MMSE-MFCC estimators and show how the proposed approach outperforms conventional MMSE estimators in the STFT domain on the AURORA4 robust ASR task. © 2006-2012 IEEE.

Raposo F.,University of Lisbon | Ribeiro R.,Instituto Universitario Of Lisbon Iscte Iul | Ribeiro R.,Spoken Language Systems Laboratory | Martins De Matos D.,University of Lisbon
IEEE/ACM Transactions on Audio Speech and Language Processing | Year: 2016

In order to satisfy processing time constraints, many music information retrieval (MIR) tasks process only a segment of the whole music signal. This may lead to decreasing performance, as the most important information for the tasks may not be in the processed segments. We leverage generic summarization algorithms, previously applied to text and speech, to summarize items in music datasets. These algorithms build summaries (both concise and diverse), by selecting appropriate segments from the input signal, also making them good candidates to summarize music. We evaluate the summarization process on binary and multiclass music genre classification tasks, by comparing the accuracy when using summarized datasets against the accuracy when using human-oriented summaries, continuous segments (the traditional method used for addressing the previously mentioned time constraints), and full songs of the original dataset. We show that GRASSHOPPER, LexRank, LSA, MMR, and a Support Sets-based centrality model improve classification performance when compared to selected baselines. We also show that summarized datasets lead to a classification performance whose difference is not statistically significant from using full songs. Furthermore, we make an argument stating the advantages of sharing summarized datasets for future MIR research. © 2014 IEEE.

Astudillo R.F.,Spoken Language Systems Laboratory
IEEE Signal Processing Letters | Year: 2013

Feature compensation is a low computational cost technique to achieve robust automatic speech recognition (ASR). Short-time Fourier Transform Uncertainty Propagation (STFT-UP) provides feature compensation in domains used for ASR as, e.g., Mel-Frequency Cepstra Coefficient (MFCC), while using STFT domain distortion models. However, STFT-UP is limited to Gaussian priors when modeling speech distortion, whereas super-Gaussian priors are known to provide improved performance. In this letter, an extension of STFT-UP is presented that uses approximate super-Gaussian priors. This is achieved by extending the conventional complex Gaussian priors to complex Gaussian mixture priors. The approach can be applied to any of the STFT-UP existing solutions, thus providing super-Gaussian uncertainty propagation. The method is exemplified by a Minimum Mean Square Error (MMSE) MFCC estimator with an approximate generalized Gamma speech prior. This estimator clearly outperforms the Gaussian-based MMSE-MFCC feature compensation on the AURORA4 corpus. © 1994-2012 IEEE.

Saeidi R.,Aalto University | Astudillo R.F.,Spoken Language Systems Laboratory | Kolossa D.,Ruhr University Bochum
IEEE Transactions on Pattern Analysis and Machine Intelligence | Year: 2016

Linear discriminant analysis (LDA) is a powerful technique in pattern recognition to reduce the dimensionality of data vectors. It maximizes discriminability by retaining only those directions that minimize the ratio of within-class and between-class variance. In this paper, using the same principles as for conventional LDA, we propose to employ uncertainties of the noisy or distorted input data in order to estimate maximally discriminant directions. We demonstrate the efficiency of the proposed uncertain LDA on two applications using state-of-the-art techniques. First, we experiment with an automatic speech recognition task, in which the uncertainty of observations is imposed by real-world additive noise. Next, we examine a full-scale speaker recognition system, considering the utterance duration as the source of uncertainty in authenticating a speaker. The experimental results show that when employing an appropriate uncertainty estimation algorithm, uncertain LDA outperforms its conventional LDA counterpart. © 2015 IEEE.

Discover hidden collaborations