Time filter

Source Type

Raposo F.,University of Lisbon | Ribeiro R.,Instituto Universitario Of Lisbon Iscte Iul | Ribeiro R.,Spoken Language Systems Laboratory | Martins De Matos D.,University of Lisbon
IEEE/ACM Transactions on Audio Speech and Language Processing | Year: 2016

In order to satisfy processing time constraints, many music information retrieval (MIR) tasks process only a segment of the whole music signal. This may lead to decreasing performance, as the most important information for the tasks may not be in the processed segments. We leverage generic summarization algorithms, previously applied to text and speech, to summarize items in music datasets. These algorithms build summaries (both concise and diverse), by selecting appropriate segments from the input signal, also making them good candidates to summarize music. We evaluate the summarization process on binary and multiclass music genre classification tasks, by comparing the accuracy when using summarized datasets against the accuracy when using human-oriented summaries, continuous segments (the traditional method used for addressing the previously mentioned time constraints), and full songs of the original dataset. We show that GRASSHOPPER, LexRank, LSA, MMR, and a Support Sets-based centrality model improve classification performance when compared to selected baselines. We also show that summarized datasets lead to a classification performance whose difference is not statistically significant from using full songs. Furthermore, we make an argument stating the advantages of sharing summarized datasets for future MIR research. © 2014 IEEE.


Neto J.,Spoken Language Systems Laboratory | Neto J.,University of Lisbon | Meinedo H.,Spoken Language Systems Laboratory | Viveiros M.,Speech Processing Technologies SA
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | Year: 2011

There are large amounts of information as video and audio not searchable. In a time where Business Intelligence is fundamental for all areas doing this kind of analysis only on text sources is a limiting factor. The use of large vocabulary speech recognition systems with increasing performance is giving rise to different applications. Despite the diversity, these applications share the extensive use of the contents of the transcription. In this paper we describe the results of a development project between a startup company and a research lab to build a full automatic system for monitoring TV and Radio channels. This system is composed by three main blocks: a recording block (records the selected channels in high and web streaming quality and broadcast to the next block), a processing block (generate metadata information) and a storage and accessing block (make available the metadata and videos). There is an optional block, the delivering, that could be customized according to the use and client needs. The processing block receives the video from the recording block time filtered by a scheduling interface. For that video the goal is to generate a full annotation with metadata describing the content and semantic information. This metadata make possible a filtering process for selective dissemination of information. © 2011 IEEE.


Graca J.V.,Spoken Language Systems Laboratory | Ganchev K.,University of Pennsylvania | Taskar B.,University of Pennsylvania
Computational Linguistics | Year: 2010

Word-level alignment of bilingual text is a critical resource for a growing variety of tasks. Probabilistic models for word alignment present a fundamental trade-off between richness of captured constraints and correlations versus efficiency and tractability of inference. In this article, we use the Posterior Regularization framework (Gra̧ca, Ganchev, and Taskar 2007) to incorporate complex constraints into probabilistic models during learning without changing the efficiency of the underlying model. We focus on the simple and tractable hidden Markov model, and present an efficient learning algorithm for incorporating approximate bijectivity and symmetry constraints. Models estimated with these constraints produce a significant boost in performance as measured by both precision and recall of manually annotated alignments for six language pairs. We also report experiments on two different tasks where word alignments are required: phrase-based machine translation and syntax transfer, and show promising improvements over standard methods. © 2010 Association for Computational Linguistics.


Nickel R.M.,Bucknell University | Astudillo R.F.,Spoken Language Systems Laboratory | Kolossa D.,Ruhr University Bochum | Martin R.,Ruhr University Bochum
IEEE Transactions on Audio, Speech and Language Processing | Year: 2013

We present a new approach for corpus-based speech enhancement that significantly improves over a method published by Xiao and Nickel in 2010. Corpus-based enhancement systems do not merely filter an incoming noisy signal, but resynthesize its speech content via an inventory of pre-recorded clean signals. The goal of the procedure is to perceptually improve the sound of speech signals in background noise. The proposed new method modifies Xiao's method in four significant ways. Firstly, it employs a Gaussian mixture model (GMM) instead of a vector quantizer in the phoneme recognition front-end. Secondly, the state decoding of the recognition stage is supported with an uncertainty modeling technique. With the GMM and the uncertainty modeling it is possible to eliminate the need for noise dependent system training. Thirdly, the post-processing of the original method via sinusoidal modeling is replaced with a powerful cepstral smoothing operation. And lastly, due to the improvements of these modifications, it is possible to extend the operational bandwidth of the procedure from 4 kHz to 8 kHz. The performance of the proposed method was evaluated across different noise types and different signal-to-noise ratios. The new method was able to significantly outperform traditional methods, including the one by Xiao and Nickel, in terms of PESQ scores and other objective quality measures. Results of subjective CMOS tests over a smaller set of test samples support our claims. © 2006-2012 IEEE.


Saeidi R.,Aalto University | Astudillo R.F.,Spoken Language Systems Laboratory | Kolossa D.,Ruhr University Bochum
IEEE Transactions on Pattern Analysis and Machine Intelligence | Year: 2016

Linear discriminant analysis (LDA) is a powerful technique in pattern recognition to reduce the dimensionality of data vectors. It maximizes discriminability by retaining only those directions that minimize the ratio of within-class and between-class variance. In this paper, using the same principles as for conventional LDA, we propose to employ uncertainties of the noisy or distorted input data in order to estimate maximally discriminant directions. We demonstrate the efficiency of the proposed uncertain LDA on two applications using state-of-the-art techniques. First, we experiment with an automatic speech recognition task, in which the uncertainty of observations is imposed by real-world additive noise. Next, we examine a full-scale speaker recognition system, considering the utterance duration as the source of uncertainty in authenticating a speaker. The experimental results show that when employing an appropriate uncertainty estimation algorithm, uncertain LDA outperforms its conventional LDA counterpart. © 2015 IEEE.


Nickel R.M.,Bucknell University | Astudillo R.F.,Spoken Language Systems Laboratory | Kolossa D.,Ruhr University Bochum | Zeiler S.,Ruhr University Bochum | Martin R.,Ruhr University Bochum
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | Year: 2012

We present a new method for inventory-style speech enhancement that significantly improves over earlier approaches [1]. Inventory-style enhancement attempts to resynthesize a clean speech signal from a noisy signal via corpus-based speech synthesis. The advantage of such an approach is that one is not bound to trade noise suppression against signal distortion in the same way that most traditional methods do. A significant improvement in perceptual quality is typically the result. Disadvantages of this new approach, however, include speaker dependency, increased processing delays, and the necessity of substantial system training. Earlier published methods relied on a-priori knowledge of the expected noise type during the training process [1]. In this paper we present a new method that exploits uncertainty-of-observation techniques to circumvent the need for noise specific training. Experimental results show that the new method is not only able to match, but outperform the earlier approaches in perceptual quality. © 2012 IEEE.


Kolossa D.,Ruhr University Bochum | Zeiler S.,Ruhr University Bochum | Saeidi R.,Radboud University Nijmegen | Astudillo R.F.,Spoken Language Systems Laboratory
IEEE Signal Processing Letters | Year: 2013

Automatic speech recognition (ASR) performance suffers severely from non-stationary noise, precluding widespread use of ASR in natural environments. Recently, so-termed uncertainty-of-observation techniques have helped to recover good performance. These consider the clean speech features as a hidden variable, of which the observable features are only an imperfect estimate. An estimated error variance of features is therefore used to further guide recognition. Based on the same idea, we introduce a new strategy: Reducing the speech feature dimensionality for optimal discriminance under observation uncertainty can yield significantly improved recognition performance, and is derived easily via Fisher's criterion of discriminant analysis. © 1994-2012 IEEE.


Astudillo R.F.,Spoken Language Systems Laboratory | Gerkmann T.,University of Oldenburg
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | Year: 2013

The Gaussian distortion model in the short-time Fourier transform (STFT) domain is the basis of many of the modern speech enhancement algorithms. One of the reasons is that additive sources and late reverberation can be analyzed and processed quite efficiently in this domain. The STFT domain is however not well related to acoustic quality and is also not well suited for learning models due to the high variability of speech in this domain. On the other hand, the cepstral domain has proved to be very well suited for these last two purposes, however, at the cost of loosing the simple linear relation between desired source and additive interferences. In this paper we explore the relation between the Gaussian distortion models in the STFT and the cepstral domain. We show how the assumption of a jointly Gaussian distortion model in the cepstrum domain is fulfilled for well-known distortion models in STFT domain. We provide closed-form solutions relating the joint distributions of corrupted and clean speech in the STFT and the cepstrum domain. We also propose various ways in which this model can be used to enhance speech. © 2013 IEEE.


Astudillo R.F.,Spoken Language Systems Laboratory | Orglmeister R.,TU Berlin
IEEE Transactions on Audio, Speech and Language Processing | Year: 2013

In this paper we demonstrate how uncertainty propagation allows the computation of minimum mean square error (MMSE) estimates in the feature domain for various feature extraction methods using short-time Fourier transform (STFT) domain distortion models. In addition to this, a measure of estimate reliability is also attained which allows either feature re-estimation or the dynamic compensation of automatic speech recognition (ASR) models. The proposed method transforms the posterior distribution associated to a Wiener filter through the feature extraction using the STFT Uncertainty Propagation formulas. It is also shown that non-linear estimators in the STFT domain like the Ephraim-Malah filters can be seen as special cases of a propagation of the Wiener posterior. The method is illustrated by developing two MMSE-Mel-frequency Cepstral Coefficient (MFCC) estimators and combining them with observation uncertainty techniques. We discuss similarities with other MMSE-MFCC estimators and show how the proposed approach outperforms conventional MMSE estimators in the STFT domain on the AURORA4 robust ASR task. © 2006-2012 IEEE.


Astudillo R.F.,Spoken Language Systems Laboratory
IEEE Signal Processing Letters | Year: 2013

Feature compensation is a low computational cost technique to achieve robust automatic speech recognition (ASR). Short-time Fourier Transform Uncertainty Propagation (STFT-UP) provides feature compensation in domains used for ASR as, e.g., Mel-Frequency Cepstra Coefficient (MFCC), while using STFT domain distortion models. However, STFT-UP is limited to Gaussian priors when modeling speech distortion, whereas super-Gaussian priors are known to provide improved performance. In this letter, an extension of STFT-UP is presented that uses approximate super-Gaussian priors. This is achieved by extending the conventional complex Gaussian priors to complex Gaussian mixture priors. The approach can be applied to any of the STFT-UP existing solutions, thus providing super-Gaussian uncertainty propagation. The method is exemplified by a Minimum Mean Square Error (MMSE) MFCC estimator with an approximate generalized Gamma speech prior. This estimator clearly outperforms the Gaussian-based MMSE-MFCC feature compensation on the AURORA4 corpus. © 1994-2012 IEEE.

Loading Spoken Language Systems Laboratory collaborators
Loading Spoken Language Systems Laboratory collaborators