Shenzhen Key Laboratory of High Performance Data Mining

Shenzhen, China

Shenzhen Key Laboratory of High Performance Data Mining

Shenzhen, China

Time filter

Source Type

Chen J.,Guangdong University of Technology | Chen J.,Sun Yat Sen University | Chen J.,Shenzhen Key Laboratory of High Performance Data Mining | Ma Z.,Sun Yat Sen University | Liu Y.,Sun Yat Sen University
IEEE Transactions on Neural Networks and Learning Systems | Year: 2013

Dimensionality reduction is vital in many fields, and alignment-based methods for nonlinear dimensionality reduction have become popular recently because they can map the high-dimensional data into a low-dimensional subspace with the property of local isometry. However, the relationships between patches in original high-dimensional space cannot be ensured to be fully preserved during the alignment process. In this paper, we propose a novel method for nonlinear dimensionality reduction called local coordinates alignment with global preservation. We first introduce a reasonable definition of topology-preserving landmarks (TPLs), which not only contribute to preserving the global structure of datasets and constructing a collection of overlapping linear patches, but they also ensure that the right landmark is allocated to the new test point. Then, an existing method for dimensionality reduction that has good performance in preserving the global structure is used to derive the low-dimensional coordinates of TPLs. Local coordinates of each patch are derived using tangent space of the manifold at the corresponding landmark, and then these local coordinates are aligned into a global coordinate space with the set of landmarks in low-dimensional space as reference points. The proposed alignment method, called landmarks-based alignment, can produce a closed-form solution without any constraints, while most previous alignment-based methods impose the unit covariance constraint, which will result in the deficiency of global metrics and undesired rescaling of the manifold. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithm. © 2012 IEEE.


Ye Y.,Harbin Institute of Technology | Ye Y.,Applied Technology Internet | Wu Q.,Harbin Institute of Technology | Wu Q.,Applied Technology Internet | And 5 more authors.
Pattern Recognition | Year: 2013

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. © 2012 Elsevier Ltd.


He B.,Chongqing University of Technology | He B.,Shenzhen Key Laboratory of High Performance Data Mining
Proceedings of the 2nd International Conference on Electronic and Mechanical Engineering and Information Technology, EMEIT 2012 | Year: 2012

There were some problems in traditional mining algorithm of association rules: a lot of candidate itemsets and communication traffic. Aiming at these problems, this paper proposed a fast mining algorithm of association rules based on cloud computing, namely, FMAAR algorithm. Firstly, the frequent items were found. Secondly, the FP-tree was created and the frequent itemsets were mined by FP-growth algorithm. Finally, the association rules were got by cloud computing. The experimental results suggest that FMAAR algorithm is fast and effective. © the authors.


He B.,Chongqing University of Technology | He B.,Nanjing University | He B.,Shenzhen Key Laboratory of High Performance Data Mining
Advances in Intelligent Systems and Computing | Year: 2014

Cloud computing is large scale and highly scalable. The data mining based on cloud computing was a very important field. The paper proposed the algorithm of mining frequent itemsets based on mapReduce, namely MFIM algorithm. MFIM algorithm distributed data according horizontal projection method. MFIM algorithm made nodes compute local frequent itemsets with by FPtree and mapReduce, then the center node exchanged data with other nodes and combined; finally, global frequent itemsets were gained by mapReduce. Theoretical analysis and experimental results suggest that MFIM algorithm is fast and effective. © Springer India 2014.


He B.,Chongqing University of Technology | He B.,Nanjing University | He B.,Shenzhen Key Laboratory of High Performance Data Mining
Lecture Notes in Electrical Engineering | Year: 2016

The paper proposed the algorithm for mining global frequent itemsets based on cloud computing, namely MGFICC algorithm. MGFICC algorithm made each nodes compute local frequent itemsets by FP-growth algorithm and mapreduce, then the center node exchanged data with other nodes, finally, global frequent itemsets were gained by mapreduce. MGFICC algorithm required less communication traffic by the searching strategies of top-down and bottom-up. Theoretical analysis and experimental results suggest that MGFICC algorithm is fast. © Springer India 2016.


He B.,Chongqing University of Technology | He B.,Nanjing University | He B.,Shenzhen Key Laboratory of High Performance Data Mining
Communications in Computer and Information Science | Year: 2013

The paper proposed a fast distributed mining algorithm of maximum frequent itemsets based on cloud computing, namely, FDMMFI algorithm. FDMMFI algorithm made nodes compute local maximum frequent itemsets by cloud computing, then the center node exchanged data with other nodes and combined, finally, global maximum frequent itemsets were gained by cloud computing. Theoretical analysis and experimental results suggest that under the same minimum support threshold, communication traffic and runtime of FDMMFI decreases while comparing with CD and FDM. The less the minimum support threshold, the better the three performance parameters of FDMMFI.FDMMFI algorithm is fast and effective. © Springer-Verlag Berlin Heidelberg 2013.


Xiong T.,CAS Shenzhen Institutes of Advanced Technology | Xiong T.,Université de Sherbrooke | Wang S.,Université de Sherbrooke | Jiang Q.,CAS Shenzhen Institutes of Advanced Technology | And 2 more authors.
IEEE Transactions on Knowledge and Data Engineering | Year: 2014

Clustering categorical sequences is an important and difficult data mining task. Despite recent efforts, the challenge remains, due to the lack of an inherently meaningful measure of pairwise similarity. In this paper, we propose a novel variable-order Markov framework, named weighted conditional probability distribution (WCPD), to model clusters of categorical sequences. We propose an efficient and effective approach to solve the challenging problem of model initialization. To initialize the WCPD model, we propose to use a first-order Markov model built on a weighted fuzzy indicator vector representation of categorical sequences, which we call the WFI Markov model. Based on a cascade optimization framework that combines the WCPD and WFI models, we design a new divisive hierarchical clustering algorithm for clustering categorical sequences. Experimental results on data sets from three different domains demonstrate the promising performance of our models and clustering algorithm. © 1989-2012 IEEE.


Sun H.,Jinan University | Cai S.,Shenzhen University | Cai S.,Shenzhen Key Laboratory of High Performance Data Mining | Weng J.,Jinan University | Massawe R.H.J.,Jinan University
Advanced Materials Research | Year: 2013

In this paper, a new component assembly library for mechanical devices is presented. Component modeling is the key technology, and component library is also an important functional module, which is a tool for instruction design in the mechanical devices. Mechanical properties are mentioned. Interactions between the component design process and the component library are realized. This paper covers the details about the implementation of the component library. The process of traditional mechanical design is simplified with the component library. Consequently, the design and development time for mechanical device instructions can be significantly shortened. © (2013) Trans Tech Publications, Switzerland.


Cheng J.,CAS Shenzhen Institutes of Advanced Technology | Cheng J.,Shenzhen Key Laboratory of High Performance Data Mining | Zeng X.,CAS Shenzhen Institutes of Advanced Technology | Yu J.X.,Chinese University of Hong Kong
Proceedings - International Conference on Data Engineering | Year: 2013

There exist many graph-based applications including bioinformatics, social science, link analysis, citation analysis, and collaborative work. All need to deal with a large data graph. Given a large data graph, in this paper, we study finding top-k answers for a graph pattern query (kGPM), and in particular, we focus on top-k cyclic graph queries where a graph query is cyclic and can be complex. The capability of supporting kGPM provides much more flexibility for a user to search graphs. And the problem itself is challenging. In this paper, we propose a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query. We observe a multidimensional representation for using multiple ranked lists to answer a given kGPM query. Under this representation, we propose a cost model to estimate the least number of tree answers to be consumed in each ranked list for a given kGPM query. This leads to a query optimization approach for kGPM processing, and a top-k algorithm to process kGPM with the optimal query plan. We conducted extensive performance studies using a synthetic dataset and a real dataset, and we confirm the efficiency of our proposed approach. © 2013 IEEE.


Guo Q.,CAS Shenzhen Institutes of Advanced Technology | Guo Q.,South China University of Technology | Luo J.,CAS Shenzhen Institutes of Advanced Technology | Luo J.,Shenzhen Key Laboratory of High Performance Data Mining | And 3 more authors.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2013

With the rapid development of location sensing technology such as GPS, huge amount of location data through GPS are produced every day. The flood of taxi GPS data make it possible to predict the plentitude of traffic events on road network. In this paper, we propose a data-driven approach for traffic state convergence prediction on road network. We introduce a new method predicting the future location of taxis on road network. Furthermore we propose a statistical model to predict real time convergence on road network. We experimentally demonstrated that our approach achieves high prediction precision on the real world massive taxi GPS data. © 2013 Springer-Verlag.

Loading Shenzhen Key Laboratory of High Performance Data Mining collaborators
Loading Shenzhen Key Laboratory of High Performance Data Mining collaborators