Key Laboratory of Speech Acoustics and Content Understanding

Laboratory of, China

Key Laboratory of Speech Acoustics and Content Understanding

Laboratory of, China
SEARCH FILTERS
Time filter
Source Type

Zhang J.,Key Laboratory of Speech Acoustics and Content Understanding | Pan F.,Key Laboratory of Speech Acoustics and Content Understanding | Dong B.,Key Laboratory of Speech Acoustics and Content Understanding | Zhao Q.,Key Laboratory of Speech Acoustics and Content Understanding | Yan Y.,Key Laboratory of Speech Acoustics and Content Understanding
IEICE Transactions on Information and Systems | Year: 2012

This paper presents our investigation into improving the performance of our previous automatic reading quality assessment system. The method of the baseline system is calculating the average value of the Phone Log-Posterior Probability (PLPP) of all phones in the voice to be assessed, and the average value is used as the reading quality assessment feature. In this paper, we presents three improvements. First, we cluster the triphones, and then calculate the average value of the normalized PLPP for each classification separately, and use this average values as the multidimensional assessment features instead of the original one-dimensional assessment feature. This method is simple but effective, which made the score difference of the machine scoring and manual scoring decrease by 30.2% relatively. Second, in order to assess the reading rhythm, we train Gaussian Mixture Models (GMM), which contain the information of each triphone's relative duration under standard pronunciation. Using the GMM, we can calculate the probability that the relative duration of each phone is conform to the standard pronunciation, and the average value of the probabilities is added to the assessment feature vector as a dimension of feature, which decreased the score difference between the machine scoring and manual scoring by 9.7% relatively. Third, we detect Filled Pauses (FP) by analyzing the formant curve, and then calculate the relative duration of FP, and add the relative duration of FP to the assessment feature vector as a dimension of feature. This method made the score difference between the machine scoring and manual scoring be further decreased by 10.2% relatively. Finally, when the feature vector extracted by the three methods are used together, the score difference between the machine scoring and manual scoring was decreased by 43.9% relatively compared to the baseline system. Copyright © 2012 The Institute of Electronics, Information and Communication Engineers.


Zhang P.,Key Laboratory of Speech Acoustics and Content Understanding | Liu Y.,University of Sheffield | Hain T.,University of Sheffield
2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings | Year: 2014

Training acoustic models for ASR requires large amounts of labelled data which is costly to obtain. Hence it is desirable to make use of unlabelled data. While unsupervised training can give gains for standard HMM training, it is more difficult to make use of unlabelled data for discriminative models. This paper explores semi-supervised training of Deep Neural Networks (DNN) in a meeting recognition task. We first analyse the impact of imperfect transcription on the DNN and the ASR performance. As labelling error is the source of the problem, we investigate two options available to reduce that: selecting data with fewer errors, and changing the dependence on noise by reducing label precision. Both confidence based data selection and label resolution change are explored in the context of two scenarios of matched and unmatched unlabelled data. We introduce improved DNN based confidence score estimators and show their performance on data selection for both scenarios. Confidence score based data selection was found to yield up to 14.6% relative WER reduction, while better balance between label resolution and recognition hypothesis accuracy allowed further WER reductions by 16.6% relative in the mismatched scenario. © 2014 IEEE.


Liu Y.,University of Sheffield | Zhang P.,Key Laboratory of Speech Acoustics and Content Understanding | Hain T.,University of Sheffield
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | Year: 2014

This paper presents an investigation of far field speech recognition using beamforming and channel concatenation in the context of Deep Neural Network (DNN) based feature extraction. While speech enhancement with beamforming is attractive, the algorithms are typically signal-based with no information about the special properties of speech. A simple alternative to beamforming is concatenating multiple channel features. Results presented in this paper indicate that channel concatenation gives similar or better results. On average the DNN front-end yields a 25% relative reduction in Word Error Rate (WER). Further experiments aim at including relevant information in training adapted DNN features. Augmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over the standard DNN based recognition system, and yields additional improvements for far field speech recognition. © 2014 IEEE.


Zhang J.,Key Laboratory of Speech Acoustics and Content Understanding | Pan F.,Key Laboratory of Speech Acoustics and Content Understanding | Dong B.,Key Laboratory of Speech Acoustics and Content Understanding | Zhao Q.,Key Laboratory of Speech Acoustics and Content Understanding | Yan Y.,Key Laboratory of Speech Acoustics and Content Understanding
IEICE Transactions on Information and Systems | Year: 2013

In this paper, we presented a novel method for automatic pronunciation quality assessment. Unlike the popular "Goodness of Pronunciation" (GOP) method, this method does not map the decoding confidence into pronunciation quality score, but differentiates the different pronunciation quality utterances directly. In this method, the student's utterance need to be decoded for two times. The first-time decoding was for getting the time points of each phone of the utterance by a forced alignment using a conventional trained acoustic model (AM). The second-time decoding was for differentiating the pronunciation quality for each triphone using a specially trained AM, where the triphones in different pronunciation qualities were trained as different units, and the model was trained in discriminative method to ensure the model has the best discrimination among the triphones whose names were same but pronunciation quality scores were different. The decoding network in the second-time decoding included different pronunciation quality triphones, so the phone-level scores can be obtained from the decoding result directly. The phone-level scores were combined into the sentence-level scores using maximum entropy criterion. The experimental results shows that the scoring performance was increased significantly compared to the GOP method, especially in sentence-level. Copyright © 2013 The Institute of Electronics, Information and Communication Engineers.

Loading Key Laboratory of Speech Acoustics and Content Understanding collaborators
Loading Key Laboratory of Speech Acoustics and Content Understanding collaborators