Time filter

Source Type

Zhang C.,Beijing Institute of Technology | Zhang C.,University of Adelaide | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | Shen T.,Beijing Institute of Technology
International Journal of Computer Vision | Year: 2015

We propose a fast, accurate matching method for estimating dense pixel correspondences across scenes. It is a challenging problem to estimate dense pixel correspondences between images depicting different scenes or instances of the same object category. While most such matching methods rely on hand-crafted features such as SIFT, we learn features from a large amount of unlabeled image patches using unsupervised learning. Pixel-layer features are obtained by encoding over the dictionary, followed by spatial pooling to obtain patch-layer features. The learned features are then seamlessly embedded into a multi-layer matching framework. We experimentally demonstrate that the learned features, together with our matching model, outperform state-of-the-art methods such as the SIFT flow (Liu et al. in IEEE Trans Pattern Anal Mach Intell 33(5):978–994, 2011), coherency sensitive hashing (Korman and Avidan in: Proceedings of the IEEE international conference on computer vision (ICCV), 2011) and the recent deformable spatial pyramid matching (Kim et al. in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013) methods both in terms of accuracy and computation efficiency. Furthermore, we evaluate the performance of a few different dictionary learning and feature encoding methods in the proposed pixel correspondence estimation framework, and analyze the impact of dictionary learning and feature encoding with respect to the final matching performance. © 2015 Springer Science+Business Media New York


Liu L.,University of Adelaide | Wang L.,University of Wollongong | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision
IEEE Transactions on Pattern Analysis and Machine Intelligence | Year: 2016

Compact and discriminative visual codebooks are preferred in many visual recognition tasks. In the literature, a number of works have taken the approach of hierarchically merging visual words of an initial large-sized codebook, but implemented this approach with different merging criteria. In this work, we propose a single probabilistic framework to unify these merging criteria, by identifying two key factors: the function used to model the class-conditional distribution and the method used to estimate the distribution parameters. More importantly, by adopting new distribution functions and/or parameter estimation methods, our framework can readily produce a spectrum of novel merging criteria. Three of them are specifically discussed in this paper. For the first criterion, we adopt the multinomial distribution with the Bayesian method; For the second criterion, we integrate the Gaussian distribution with maximum likelihood parameter estimation. For the third criterion, which shows the best merging performance, we propose a max-margin-based parameter estimation method and apply it with the multinomial distribution. Extensive experimental study is conducted to systematically analyze the performance of the above three criteria and compare them with existing ones. As demonstrated, the best criterion within our framework achieves the overall best merging performance among the compared merging criteria developed in the literature. © 1979-2012 IEEE.


Wang P.,University of Adelaide | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | Van Den Hengel A.,University of Adelaide | Van Den Hengel A.,Australian Center for Robotic Vision
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Year: 2015

Conditional Random Fields (CRFs) are one of the core technologies in computer vision, and have been applied to a wide variety of tasks. Conventional CRFs typically define edges between neighboring image pixels, resulting in a sparse graph over which inference can be performed efficiently. However, these CRFs fail to model more complex priors such as long-range contextual relationships. Fully-connected CRFs have thus been proposed. While there are efficient approximate inference methods for such CRFs, usually they are sensitive to initialization and make strong assumptions. In this work, we develop an efficient, yet general SDP algorithm for inference on fully-connected CRFs. The core of the proposed algorithm is a tailored quasi-Newton method, which solves a specialized SDP dual problem and takes advantage of the low-rank matrix approximation for fast computation. Experiments demonstrate that our method can be applied to fully-connected CRFs that could not previously be solved, such as those arising in pixel-level image co-segmentation. © 2015 IEEE.


Li Y.,University of Adelaide | Li Y.,NICTA | Liu L.,University of Adelaide | Shen C.,University of Adelaide | And 3 more authors.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Year: 2015

Mid-level visual element discovery aims to find clusters of image patches that are both representative and discriminative. In this work, we study this problem from the prospective of pattern mining while relying on the recently popularized Convolutional Neural Networks (CNNs). Specifically, we find that for an image patch, activation extracted from the first fully-connected layer of a CNN have two appealing properties which enable its seamless integration with pattern mining. Patterns are then discovered from a large number of CNN activations of image patches through the well-known association rule mining. When we retrieve and visualize image patches with the same pattern (See Fig. 1), surprisingly, they are not only visually similar but also semantically consistent. We apply our approach to scene and object classification tasks, and demonstrate that our approach outperforms all previous works on mid-level visual element discovery by a sizeable margin with far fewer elements being used. Our approach also outperforms or matches recent works using CNN for these tasks. Source code of the complete system is available online. © 2015 IEEE.


Paisitkriangkrai S.,University of Adelaide | Paisitkriangkrai S.,Australian Center for Robotic Vision | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | And 2 more authors.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Year: 2015

We propose an effective structured learning based approach to the problem of person re-identification which outperforms the current state-of-the-art on most benchmark data sets evaluated. Our framework is built on the basis of multiple low-level hand-crafted and high-level visual features. We then formulate two optimization algorithms, which directly optimize evaluation measures commonly used in person re-identification, also known as the Cumulative Matching Characteristic (CMC) curve. Our new approach is practical to many real-world surveillance applications as the re-identification performance can be concentrated in the range of most practical importance. The combination of these factors leads to a person re-identification system which outperforms most existing algorithms. More importantly, we advance state-of-the-art results on person re-identification by improving the rank-1 recognition rates from 40% to 50% on the iLIDS benchmark, 16% to 18% on the PRID2011 benchmark, 43% to 46% on the VIPeR benchmark, 34% to 53% on the CUHK01 benchmark and 21% to 62% on the CUHK03 benchmark. © 2015 IEEE.


Liu F.,University of Adelaide | Liu F.,Australian Center for Robotic Vision | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | And 2 more authors.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Year: 2015

We consider the problem of depth estimation from a single monocular image in this work. It is a challenging task as no reliable depth cues are available, e.g., stereo correspondences, motions etc. Previous efforts have been focusing on exploiting geometric priors or additional sources of information, with all using hand-crafted features. Recently, there is mounting evidence that features from deep convolutional neural networks (CNN) are setting new records for various vision applications. On the other hand, considering the continuous characteristic of the depth values, depth estimations can be naturally formulated into a continuous conditional random field (CRF) learning problem. Therefore, we in this paper present a deep convolutional neural field model for estimating depths from a single image, aiming to jointly explore the capacity of deep CNN and continuous CRF. Specifically, we propose a deep structured learning scheme which learns the unary and pairwise potentials of continuous CRF in a unified deep CNN framework. The proposed method can be used for depth estimations of general scenes with no geometric priors nor any extra information injected. In our case, the integral of the partition function can be analytically calculated, thus we can exactly solve the log-likelihood optimization. Moreover, solving the MAP problem for predicting depths of a new image is highly efficient as closed-form solutions exist. We experimentally demonstrate that the proposed method outperforms state-of-the-art depth estimation methods on both indoor and outdoor scene datasets. © 2015 IEEE.


Li H.,University of Adelaide | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | Van Den Hengel A.,University of Adelaide | And 2 more authors.
IEEE Transactions on Image Processing | Year: 2015

In this paper, we propose an efficient semidefinite programming (SDP) approach to worst case linear discriminant analysis (WLDA). Compared with the traditional LDA, WLDA considers the dimensionality reduction problem from the worst case viewpoint, which is in general more robust for classification. However, the original problem of WLDA is non-convex and difficult to optimize. In this paper, we reformulate the optimization problem of WLDA into a sequence of semidefinite feasibility problems. To efficiently solve the semidefinite feasibility problems, we design a new scalable optimization method with a quasi-Newton method and eigen-decomposition being the core components. The proposed method is orders of magnitude faster than standard interior-point SDP solvers. Experiments on a variety of classification problems demonstrate that our approach achieves better performance than standard LDA. Our method is also much faster and more scalable than standard interior-point SDP solvers-based WLDA. The computational complexity for an SDP with m constraints and matrices of size d by d is roughly reduced from O(m3+md3+m2d2) to O(d3) (m>d in our case). © 2015 IEEE.


Liu L.,University of Adelaide | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | Van Den Hengel A.,University of Adelaide | Van Den Hengel A.,Australian Center for Robotic Vision
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Year: 2015

A number of recent studies have shown that a Deep Convolutional Neural Network (DCNN) pretrained on a large dataset can be adopted as a universal image descriptor, and that doing so leads to impressive performance at a range of image classification tasks. Most of these studies, if not all, adopt activations of the fully-connected layer of a DCNN as the image or region representation and it is believed that convolutional layer activations are less discriminative. This paper, however, advocates that if used appropriately, convolutional layer activations constitute a powerful image representation. This is achieved by adopting a new technique proposed in this paper called cross-convolutional-layer pooling. More specifically, it extracts subarrays of feature maps of one convolutional layer as local features, and pools the extracted features with the guidance of the feature maps of the successive convolutional layer. Compared with existing methods that apply DCNNs in the similar local feature setting, the proposed method avoids the input image style mismatching issue which is usually encountered when applying fully connected layer activations to describe local regions. Also, the proposed method is easier to implement since it is codebook free and does not have any tuning parameters. By applying our method to four popular visual classification tasks, it is demonstrated that the proposed method can achieve comparable or in some cases significantly better performance than existing fully-connected layer based image representations. © 2015 IEEE.


Shen F.,University of Electronic Science and Technology of China | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | Liu W.,IBM | Shen H.T.,University of Queensland
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Year: 2015

Recently, learning based hashing techniques have attracted broad research interests because they can support efficient storage and retrieval for high-dimensional data such as images, videos, documents, etc. However, a major difficulty of learning to hash lies in handling the discrete constraints imposed on the pursued hash codes, which typically makes hash optimizations very challenging (NP-hard in general). In this work, we propose a new supervised hashing framework, where the learning objective is to generate the optimal binary hash codes for linear classification. By introducing an auxiliary variable, we reformulate the objective such that it can be solved substantially efficiently by employing a regularization algorithm. One of the key steps in this algorithm is to solve a regularization sub-problem associated with the NP-hard binary optimization. We show that the sub-problem admits an analytical solution via cyclic coordinate descent. As such, a high-quality discrete solution can eventually be obtained in an efficient computing manner, therefore enabling to tackle massive datasets. We evaluate the proposed approach, dubbed Supervised Discrete Hashing (SDH), on four large image datasets and demonstrate its superiority to the state-of-the-art hashing methods in large-scale image retrieval. © 2015 IEEE.


Li B.,Northwestern Polytechnical University | Li B.,University of Adelaide | Shen C.,University of Adelaide | Shen C.,Australian Center for Robotic Vision | And 4 more authors.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | Year: 2015

Predicting the depth (or surface normal) of a scene from single monocular color images is a challenging task. This paper tackles this challenging and essentially underdetermined problem by regression on deep convolutional neural network (DCNN) features, combined with a post-processing refining step using conditional random fields (CRF). Our framework works at two levels, super-pixel level and pixel level. First, we design a DCNN model to learn the mapping from multi-scale image patches to depth or surface normal values at the super-pixel level. Second, the estimated super-pixel depth or surface normal is refined to the pixel level by exploiting various potentials on the depth or surface normal map, which includes a data term, a smoothness term among super-pixels and an auto-regression term characterizing the local structure of the estimation map. The inference problem can be efficiently solved because it admits a closed-form solution. Experiments on the Make3D and NYU Depth V2 datasets show competitive results compared with recent state-of-the-art methods. © 2015 IEEE.

Loading Australian Center for Robotic Vision collaborators
Loading Australian Center for Robotic Vision collaborators