Entity

Time filter

Source Type

Kensington, Australia

Thi T.H.,National ICT of Australia | Thi T.H.,University of New South Wales | Cheng L.,Agency for Science, Technology and Research Singapore | Zhang J.,National ICT of Australia | And 3 more authors.
Image and Vision Computing | Year: 2012

Human action recognition is a promising yet non-trivial computer vision field with many potential applications. Current advances in bag-of-feature approaches have brought significant insights into recognizing human actions within complex context. It is, however, a common practice in literature to consider action as merely an orderless set of local salient features. This representation has been shown to be oversimplified, which inherently limits traditional approaches from robust deployment in real-life scenarios. In this work, we propose and show that, by taking into account global configuration of local features, we can greatly improve recognition performance. We first introduce a novel feature selection process called Sparse Hierarchical Bayes Filter to select only the most contributive features of each action type based on neighboring structure constraints. We then present the application of structured learning in human action analysis. That is, by representing human action as a complex set of local features, we can incorporate different spatial and temporal feature constraints into the learning tasks of human action classification and localization. In particular, we tackle the problem of action localization in video using structured learning with two alternatives: one is Dynamic Conditional Random Field from probabilistic perspective; the other is Structural Support Vector Machine from max-margin point of view. We evaluate our modular classification-localization framework on various testbeds, in which our proposed framework is proven to be highly effective and robust compared against bag-of-feature methods. © 2011. Source


Thi T.H.,National ICT of Australia | Thi T.H.,University of New South Wales | Cheng L.,Bioinformatics Institute | Zhang J.,National ICT of Australia | And 3 more authors.
Computer Vision and Image Understanding | Year: 2012

In this paper, we propose a framework for human action analysis from video footage. A video action sequence in our perspective is a dynamic structure of sparse local spatial-temporal patches termed action elements, so the problems of action analysis in video are carried out here based on the set of local characteristics as well as global shape of a prescribed action. We first detect a set of action elements that are the most compact entities of an action, then we extend the idea of Implicit Shape Model to space time, in order to properly integrate the spatial and temporal properties of these action elements. In particular, we consider two different recipes to construct action elements: one is to use a Sparse Bayesian Feature Classifier to choose action elements from all detected Spatial Temporal Interest Points, and is termed discriminative action elements. The other one detects affine invariant local features from the holistic Motion History Images, and picks up action elements according to their compactness scores, and is called generative action elements. Action elements detected from either way are then used to construct a voting space based on their local feature representations as well as their global configuration constraints. Our approach is evaluated in the two main contexts of current human action analysis challenges, action retrieval and action classification. Comprehensive experimental results show that our proposed framework marginally outperforms all existing state-of-the-arts techniques on a range of different datasets. © 2011 Elsevier Inc. All rights reserved. Source


Thi T.H.,National ICT of Australia | Thi T.H.,University of New South Wales | Cheng L.,Bioinformatics Institute | Zhang J.,National ICT of Australia | And 3 more authors.
Image and Vision Computing | Year: 2012

Human action recognition is a promising yet non-trivial computer vision field with many potential applications. Current advances in bag-of-feature approaches have brought significant insights into recognizing human actions within complex context. It is, however, a common practice in literature to consider action as merely an orderless set of local salient features. This representation has been shown to be oversimplified, which inherently limits traditional approaches from robust deployment in real-life scenarios. In this work, we propose and show that, by taking into account global configuration of local features, we can greatly improve recognition performance. We first introduce a novel feature selection process called Sparse Hierarchical Bayes Filter to select only the most contributive features of each action type based on neighboring structure constraints. We then present the application of structured learning in human action analysis. That is, by representing human action as a complex set of local features, we can incorporate different spatial and temporal feature constraints into the learning tasks of human action classification and localization. In particular, we tackle the problem of action localization in video using structured learning with two alternatives: one is Dynamic Conditional Random Field from probabilistic perspective; the other is Structural Support Vector Machine from max-margin point of view. We evaluate our modular classification-localization framework on various testbeds, in which our proposed framework is proven to be highly effective and robust compared against bag-of-feature methods. © 2011 Elsevier B.V. All rights reserved. Source


Thi T.H.,National ICT of Australia | Thi T.H.,University of New South Wales | Zhang J.,National ICT of Australia | Zhang J.,University of New South Wales | And 3 more authors.
Proceedings - IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2010 | Year: 2010

This paper presents a unified framework for human action classification and localization in video using structured learning of local space-time features. Each human action class is represented by a set of its own compact set of local patches. In our approach, we first use a discriminative hierarchical Bayesian classifier to select those space-time interest points that are constructive for each particular action. Those concise local features are then passed to a Support Vector Machine with Principal Component Analysis projection for the classification task. Meanwhile, the action localization is done using Dynamic Conditional Random Fields developed to incorporate the spatial and temporal structure constraints of superpixels extracted around those features. Each superpixel in the video is defined by the shape and motion information of its corresponding feature region. Compelling results obtained from experiments on KTH [22], Weizmann [1], HOHA [13] and TRECVid [23] datasets have proven the efficiency and robustness of our framework for the task of human action recognition and localization in video. © 2010 IEEE. Source

Discover hidden collaborations