Time filter

Source Type

Jia J.,Tsinghua University | Jia J.,Tsinghua National Laboratory for Information Sciences and Technology | Jia J.,Key Laboratory of Pervasive Computing | Leung W.-K.,Tsinghua University | And 14 more authors.
Journal of Computer Science and Technology | Year: 2014

Computer-aided pronunciation training (CAPT) technologies enable the use of automatic speech recognition to detect mispronunciations in second language (L2) learners’ speech. In order to further facilitate learning, we aim to develop a principle-based method for generating a gradation of the severity of mispronunciations. This paper presents an approach towards gradation that is motivated by auditory perception. We have developed a computational method for generating a perceptual distance (PD) between two spoken phonemes. This is used to compute the auditory confusion of native language (L1). PD is found to correlate well with the mispronunciations detected in CAPT system for Chinese learners of English, i.e., L1 being Chinese (Mandarin and Cantonese) and L2 being US English. The results show that auditory confusion is indicative of pronunciation confusions in L2 learning. PD can also be used to help us grade the severity of errors (i.e., mispronunciations that confuse more distant phonemes are more severe) and accordingly prioritize the order of corrective feedback generated for the learners. © 2014, Springer Science+Business Media New York.

Jiang J.,Tsinghua National Laboratory for Information Sciences and Technology | Jiang J.,Tsinghua Joint Research Center for Media science | Jiang J.,Tsinghua University | Jia J.,Tsinghua University | And 10 more authors.
Applied Mathematics and Information Sciences | Year: 2014

Tone recognition is the core function in Chinese speech perception. The tone perception ability of people with sensorineural hearing loss (SNHL) is often weaker than normal people. Automatically tone enhancement would be useful in helping them understand Chinese speech better. In this paper, we focus on the tone enhancing model for Chinese disyllable words. We first analyze the acoustic features related to tone perception. By agglomerative hierarchical clustering method, the first and second syllables of disyllable words are clustered into 6 clusters respectively. Discriminative features of these clusters are experimentally determined from a set of possible features related to tone perception, such as the pitch value, pitch range and position of minimum pitch, etc. We further propose a practicable tone enhancing model with these discriminative features: 1) an input pitch contour is classified by calculating the distance between it and the centroid of each cluster, and 2) selecting the smallest distance, then the unclassified pitch contour belongs to this cluster, 3) the pitch contour is modified for tone enhancement with model parameters corresponding to this cluster using TD-PSOLA. Both statistical and subjective experiments show that higher hit rate of tone recognition can be obtained after tone enhancement with the proposed model. Especially, the proposed enhancing model can also avoid traditional tone recognition, which is more convictive and less laborious. © 2014 NSP Natural Sciences Publishing Cor.

Wu B.,Tsinghua University | Wu B.,Key Laboratory of Pervasive Computing | Wu B.,Tsinghua National Laboratory for Information Sciences and Technology | Jia J.,Tsinghua University | And 6 more authors.
Proceedings - IEEE International Conference on Multimedia and Expo | Year: 2016

In this paper, we tackle the problem of inferring users' emotions in real-world Voice Dialogue Applications (VDAs, Siri1, Cortana2, etc.). We first conduct an investigation, indicating that besides the text information of users' queries, the acoustic information and query attributes are very important in inferring emotions in VDAs. To integrate the information above, we propose a Hybrid Emotion Inference Model (HEIM), which involves a Latent Dirichlet Allocation (LDA) to extract text features and a Long Short-Term Memory (LSTM) to model the acoustic features. To further improve accuracy, a Recurrent Autoencoder Guided by Query Attributes (RAGQA) which incorporates other emotion-related query attributes is proposed in HEIM to pre-train LSTM. The accuracy of HEIM on a data set collected from Sogou Voice Assistant3 (Chinese Siri) containing 93,000 utterances achieves 75.2%, which outperforms state-of-the-art methods for 33.5-38.5%. Specifically, we discover that on average, the acoustic information enhances the performance for 46.6%, while query attributes further enhance the performance for 6.5%. © 2016 IEEE.

Lin H.,Key Laboratory of Pervasive Computing | Lin H.,Tsinghua National Laboratory for Information Sciences and Technology | Lin H.,Tsinghua University | Jia J.,Key Laboratory of Pervasive Computing | And 6 more authors.
2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2013 | Year: 2013

In this paper, we present a novel interactive, multimodal and real-time 3D talking avatar application, on mobile platforms. The application is based on a novel network independent, stand-alone framework using cross-platform JNI and OpenGL ES library. In this framework, we implement the audio synthesis, facial animation rendering and the audio-visual synchronization process on the mobile client using the native APIs to optimize the render performance and power consumption. We also utilize the existing interactive APIs on the mobile devices to extend the usability of the application. Experiment results show that the proposed framework for mobile platforms can run smoothly on the current mobile devices with real-time multimodal interaction. Compared to the traditional video streaming method and the client-server framework, the proposed framework has much lower network requirement, with much shorter interaction delay and more efficient power consumption. The presented application can be used in entertainment, education and many other interactive areas. © 2013 APSIPA.

Wu B.,Tsinghua University | Wu B.,Key Laboratory of Pervasive Computing | Wu B.,Tsinghua National Laboratory for Information Sciences and Technology | Jia J.,Tsinghua University | And 7 more authors.
Proceedings - IEEE International Conference on Multimedia and Expo | Year: 2015

Understanding the essential emotions behind social images is of vital importance: it can benefit many applications such as image retrieval and personalized recommendation. While previous related research mostly focuses on the image visual features, in this paper, we aim to tackle this problem by 'linking inferring with users' demographics'. Specifically, we propose a partially-labeled factor graph model named D-FGM, to predict the emotions embedded in social images not only by the image visual features, but also by the information of users' demographics. We investigate whether users' demographics like gender, marital status and occupation are related to emotions of social images, and then leverage the uncovered patterns into modeling as different factors. Experiments on a data set from the world's largest image sharing website Flickr1 confirm the accuracy of the proposed model. The effectiveness of the users' demographics factors is also verified by the factor contribution analysis, which reveals some interesting behavioral phenomena as well. © 2015 IEEE.

Ren Z.,Tsinghua University | Ren Z.,Key Laboratory of Pervasive Computing | Jia J.,Tsinghua University | Jia J.,Key Laboratory of Pervasive Computing | And 4 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2014

Emotions are increasingly and controversially central to our public life. Compared to text or image data, voice is the most natural and direct way to express ones' emotions in real-time. With the increasing adoption of smart phone voice dialogue applications (e.g., Siri and Sogou Voice Assistant), the large-scale networked voice data can help us better quantitatively understand the emotional world we live in. In this paper, we study the problem of inferring public emotions from large-scale networked voice data. In particular, we first investigate the primary emotions and the underlying emotion patterns in human-mobile voice communication. Then we propose a partially-labeled factor graph model (PFG) to incorporate both acoustic features (e.g., energy, f0, MFCC, LFPC) and correlation features (e.g., individual consistency, time associativity, environment similarity) to automatically infer emotions. We evaluate the proposed model on a real dataset from Sogou Voice Assistant application. The experimental results verify the effectiveness of the proposed model. © 2014 Springer International Publishing.

Loading Key Laboratory of Pervasive Computing collaborators
Loading Key Laboratory of Pervasive Computing collaborators