Entity

Time filter

Source Type

Pittsburgh, PA, United States

Kumatani K.,Disney Research | McDonough J.,Carnegie Mellon University | Raj B.,Carnegie Mellon University
IEEE Signal Processing Magazine | Year: 2012

Distant speech recognition (DSR) holds the promise of the most natural human computer interface because it enables man-machine interactions through speech, without the necessity of donning intrusive body- or head-mounted microphones. Recognizing distant speech robustly, however, remains a challenge. This contribution provides a tutorial overview of DSR systems based on microphone arrays. In particular, we present recent work on acoustic beam forming for DSR, along with experimental results verifying the effectiveness of the various algorithms described here; beginning from a word error rate (WER) of 14.3% with a single microphone of a linear array, our state-of-the-art DSR system achieved a WER of 5.3%, which was comparable to that of 4.2% obtained with a lapel microphone. Moreover, we present an emerging technology in the area of far-field audio and speech processing based on spherical microphone arrays. Performance comparisons of spherical and linear arrays reveal that a spherical array with a diameter of 8.4 cm can provide recognition accuracy comparable or better than that obtained with a large linear array with an aperture length of 126 cm. © 2012 IEEE. Source


Memisevic R.,Goethe University Frankfurt | Sigal L.,Disney Research | Fleet D.J.,Kings College
IEEE Transactions on Pattern Analysis and Machine Intelligence | Year: 2012

Latent variable models, such as the GPLVM and related methods, help mitigate overfitting when learning from small or moderately sized training sets. Nevertheless, existing methods suffer from several problems: 1) complexity, 2) the lack of explicit mappings to and from the latent space, 3) an inability to cope with multimodality, and 4) the lack of a well-defined density over the latent space. We propose an LVM called the Kernel Information Embedding (KIE) that defines a coherent joint density over the input and a learned latent space. Learning is quadratic, and it works well on small data sets. We also introduce a generalization, the shared KIE (sKIE), that allows us to model multiple input spaces (e.g., image features and poses) using a single, shared latent representation. KIE and sKIE permit missing data during inference and partially labeled data during learning. We show that with data sets too large to learn a coherent global model, one can use the sKIE to learn local online models. We use sKIE for human pose inference. © 2012 IEEE. Source


Vondrak M.,Brown University | Sigal L.,Brown University | Jenkins O.C.,Disney Research
IEEE Transactions on Pattern Analysis and Machine Intelligence | Year: 2013

We propose a simulation-based dynamical motion prior for tracking human motion from video in presence of physical ground-person interactions. Most tracking approaches to date have focused on efficient inference algorithms and/or learning of prior kinematic motion models; however, few can explicitly account for the physical plausibility of recovered motion. Here, we aim to recover physically plausible motion of a single articulated human subject. Toward this end, we propose a full-body 3D physical simulation-based prior that explicitly incorporates a model of human dynamics into the Bayesian filtering framework. We consider the motion of the subject to be generated by a feedback “control loop” in which Newtonian physics approximates the rigid-body motion dynamics of the human and the environment through the application and integration of interaction forces, motor forces, and gravity. Interaction forces prevent physically impossible hypotheses, enable more appropriate reactions to the environment (e.g., ground contacts), and are produced from detected human-environment collisions. Motor forces actuate the body, ensure that proposed pose transitions are physically feasible, and are generated using a motion controller. For efficient inference in the resulting high-dimensional state space, we utilize an exemplar-based control strategy that reduces the effective search space of motor forces. As a result, we are able to recover physically plausible motion of human subjects from monocular and multiview video. We show, both quantitatively and qualitatively, that our approach performs favorably with respect to Bayesian filtering methods with standard motion priors. © 1979-2012 IEEE. Source


Sigal L.,Disney Research | Isard M.,Microsoft | Haussecker H.,Intel Corporation | Black M.J.,Max Planck Institute for Intelligent Systems (Tubingen)
International Journal of Computer Vision | Year: 2012

We formulate the problem of 3D human pose estimation and tracking as one of inference in a graphical model. Unlike traditional kinematic tree representations, our model of the body is a collection of loosely-connected body-parts. In particular, we model the body using an undirected graphical model in which nodes correspond to parts and edges to kinematic, penetration, and temporal constraints imposed by the joints and the world. These constraints are encoded using pair-wise statistical distributions, that are learned from motion-capture training data. Human pose and motion estimation is formulated as inference in this graphical model and is solved using Particle Message Passing (PaMPas). PaMPas is a form of non-parametric belief propagation that uses a variation of particle filtering that can be applied over a general graphical model with loops. The loose-limbed model and decentralized graph structure allow us to incorporate information from "bottom-up" visual cues, such as limb and head detectors, into the inference process. These detectors enable automatic initialization and aid recovery from transient tracking failures. We illustrate the method by automatically tracking people in multi-view imagery using a set of calibrated cameras and present quantitative evaluation using the HumanEva dataset. © 2011 The Author(s). Source


Smolic A.,Disney Research
Pattern Recognition | Year: 2011

This paper gives an end-to-end overview of 3D video and free viewpoint video, which can be regarded as advanced functionalities that expand the capabilities of a 2D video. Free viewpoint video can be understood as the functionality to freely navigate within real world visual scenes, as it is known for instance from virtual worlds in computer graphics. 3D video shall be understood as the functionality that provides the user with a 3D depth impression of the observed scene, which is also known as stereo video. In that sense as functionalities, 3D video and free viewpoint video are not mutually exclusive but can very well be combined in a single system. Research in this area combines computer graphics, computer vision and visual communications. It spans the whole media processing chain from capture to display and the design of systems has to take all parts into account, which is outlined in different sections of this paper giving an end-to-end view and mapping of this broad area. The conclusion is that the necessary technology including standard media formats for 3D video and free viewpoint video is available or will be available in the future, and that there is a clear demand from industry and user for such advanced types of visual media. As a consequence we are witnessing these days how such technology enters our everyday life © 2010 Elsevier Ltd. All rights reserved. Source

Discover hidden collaborations