National Engineering Laboratory of Speech and Language Information Processing

Hefei, China

National Engineering Laboratory of Speech and Language Information Processing

Hefei, China
SEARCH FILTERS
Time filter
Source Type

Song Y.,National Engineering Laboratory of Speech and Language Information Processing | Cui R.,National Engineering Laboratory of Speech and Language Information Processing | Hong X.,National Engineering Laboratory of Speech and Language Information Processing | McLoughlin I.,National Engineering Laboratory of Speech and Language Information Processing | And 2 more authors.
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings | Year: 2015

Effective representation plays an important role in automatic spoken language identification (LID). Recently, several representations that employ a pre-trained deep neural network (DNN) as the front-end feature extractor, have achieved state-of-the-art performance. However the performance is still far from satisfactory for dialect and short-duration utterance identification tasks, due to the deficiency of existing representations. To address this issue, this paper proposes the improved representations to exploit the information extracted from different layers of the DNN structure. This is conceptually motivated by regarding the DNN as a bridge between low-level acoustic input and high-level phonetic output features. Specifically, we employ deep bottleneck network (DBN), a DNN with an internal bottleneck layer acting as a feature extractor. We extract representations from two layers of this single network, i.e. DBN-TopLayer and DBN-MidLayer. Evaluations on the NIST LRE2009 dataset, as well as the more specific dialect recognition task, show that each representation can achieve an incremental performance gain. Furthermore, a simple fusion of the representations is shown to exceed current state-of-the-art performance. © 2015 IEEE.


Li X.,Hefei University of Technology | Li X.,National Engineering Laboratory of Speech and Language Information Processing | Yu J.,Hefei University of Technology | Yu J.,National Engineering Laboratory of Speech and Language Information Processing | And 3 more authors.
Shengxue Xuebao/Acta Acustica | Year: 2014

A prosody conversion method was proposed for transforming neutral speech to some required target emotion, in which F0 was modeled by DCT and converted by GMM-based method at both phrase level and syllable level, while duration was converted by CART-based method at phoneme level. A corpus consisted of three basis emotions was used for training and testing. Objective evaluation and The listening test results showed that our method can convert emotional prosody effectively, the sad emotion conversion achieved accuracy of nearly 100% in listening test.


Yu J.,National Engineering Laboratory of Speech and Language Information Processing | Li A.,Chinese Academy of Sciences
2014 IEEE International Conference on Image Processing, ICIP 2014 | Year: 2014

This paper proposes a text-driven 3D visual pronunciation system of Mandarin Chinese. Firstly, an articulatory speech corpus is collected through Electro-Magnetic Articulography, and then articulatory model and acoustic model are trained by multi-stream Hidden Semi-Markov Model together; secondly, based on an accurate anatomical 3D articulatory mesh model constructed by Magnetic Resonance Imaging, the Hidden Semi-Markov Model synthesized result is combined with articulatory anatomy to produce realistic articulatory animation synchronized with speech. The experimental results show the effectiveness of the system for instructing language learners to articulate. © 2014 IEEE.


Li X.,Hefei University of Technology | Li X.,National Engineering Laboratory of Speech and Language Information Processing | Wang Z.-F.,Hefei University of Technology | Wang Z.-F.,Chinese Academy of Sciences | Wang Z.-F.,National Engineering Laboratory of Speech and Language Information Processing
Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, ISCSLP 2014 | Year: 2014

In this paper, we present a frame correlation based autoregressive GMM method for voice conversion. In our system, the cross-frame correlation of the source feature is modeled with augmented delta features, and the cross-frame correlation of target feature is modeled by autoregressive models. The expectation maximization (EM) algorithm is used for the model training, and a maximum likelihood parameter conversion algorithm is then employed to convert the feature of a source speaker into the one of a target speaker frame by frame. This method is consistent in training and conversion by using target feature's cross-frame correlation explicitly at both stage. The experimental results show that the proposed method has excellent performance. The test set log probability of it is higher than the GMM-DYN (GMM with dynamic features) method, and the subjective evaluation results of it are also comparable to the GMM-DYN method. Furthermore, it is much more suitable for low-latency application. © 2014 IEEE.

Loading National Engineering Laboratory of Speech and Language Information Processing collaborators
Loading National Engineering Laboratory of Speech and Language Information Processing collaborators