Feng C.,CAS Institute of Computing Technology |
Zou Y.,Tencent Corporation |
Xu Z.,CAS Institute of Computing Technology
Proceedings - 7th International Conference on Semantics, Knowledge, and Grids, SKG 2011 | Year: 2011
Multi-dimensional range queries are fundamental requirements in large scale Internet applications using Distributed Ordered Tables. Apache Cassandra is a Distributed Ordered Table when it employs order-preserving hashing as data partitioner. Cassandra supports multi-dimensional range queries with poor performance and with a limitation that there must be one dimension with an equal operator. Based on the success of CCIndex scheme in Apache HBase, this paper tries to answer the question: Can CCIndex benefit multi-dimensional range queries in DOTs like Cassandra? This paper studies the feasibility of employing CCIndex in Cassandra, proposes a new approach to estimate result size, implements CCIndex in Cassandra including recovery mechanisms and studies the pros and cons of CCIndex for different DOTs. Experimental results show that CCIndex gains 2.4 to 3.7 times efficiency over Cassandra's index scheme with 1% to 50% selectivity for 2 million records. This paper shows that CCIndex is a general approach for DOTs, and could gain better performance for DOTs which perform scan tasks much faster than random read. This paper reveals that Cassandra is optimized for hash tables rather than ordered tables in performing read and range queries. © 2011 IEEE.
Wang S.,Beijing Institute of Technology |
Chen Y.,Tencent Corporation
International Journal of Data Warehousing and Mining | Year: 2014
In this paper, a novel clustering algorithm, HASTA (HierArchical-grid cluStering based on daTA field), is proposed to model the dataset as a data field by assigning all the data objects into qusantized grids. Clustering centers of HASTA are defined to locate where the maximum value of local potential is. Edges of cluster in HASTA are identified by analyzing the first-order partial derivative of potential value, thus the full size of arbitrary shaped clusters can be detected. The experimented case demonstrates that HASTA performs effectively upon different datasets and can find out clusters of arbitrary shapes in noisy circumstance. Besides those, HASTA does not force users to preset the exact amount of clusters inside dataset. Furthermore, HASTA is insensitive to the order of data input. The time complexity of HASTA achieves O(n). Those advantages will potentially benefit the mining of big data. Copyright © 2014, IGI Global.
Li Y.,University of Science and Technology Beijing |
Hu C.,University of Science and Technology Beijing |
Chen M.,Tencent Corporation |
Hu J.,University of Technology of Troyes
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2012
In this paper we investigate aesthetic features in learning aesthetic judgments in an evolutionary art system. We evolve genetic art with our evolutionary art system, BioEAS, by using genetic programming and an aesthetic learning model. The model is built by learning both phenotype and genotype features, which we extracted from internal evolutionary images and external real world paintings, which could lead to more interesting paths. By learning aesthetic judgment and applying the knowledge to evolve aesthetical images, the model helps user to automate the process of evolutionary process. Several independent experimental results show that our system is efficient to reduce user fatigue in evolving art. © 2012 Springer-Verlag.
Liu C.,Tencent Corporation |
ACM International Conference Proceeding Series | Year: 2012
Semantic analysis tries to solve problems arising from polysemy and synonymy that are abundant in natural languages. Recently, Gabrilovich and Markovitch propose the Explicit Semantic Analysis (ESA) technique, which complements the well-known Latent Semantic Analysis (LSA) technique. In this paper, we show that the two techniques are not as distinct as their names suggest; instead, we find that ESA is equivalent to a LSA variant, and this equivalence generalizes to all kernel methods using kernels arising from the canonical dot product. Effectively, this result guarantees that ESA would not outperform the peak efficacy of LSA for any applications using the above kernel methods. In short, this paper for the first time establishes the connections between ESA and LSA, quantifies their relative efficacy, and generalizes the result to a big category of kernel methods. © 2012 ACM.
Wang G.,Tencent Corporation |
Wang F.,IBM |
Chen T.,Hong Kong University of Science and Technology |
Yeung D.-Y.,Hong Kong University of Science and Technology |
Lochovsky F.H.,Hong Kong University of Science and Technology
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics | Year: 2012
Traditional learning algorithms use only labeled data for training. However, labeled examples are often difficult or time consuming to obtain since they require substantial human labeling efforts. On the other hand, unlabeled data are often relatively easy to collect. Semisupervised learning addresses this problem by using large quantities of unlabeled data with labeled data to build better learning algorithms. In this paper, we use the manifold regularization approach to formulate the semisupervised learning problem where a regularization framework which balances a tradeoff between loss and penalty is established. We investigate different implementations of the loss function and identify the methods which have the least computational expense. The regularization hyperparameter, which determines the balance between loss and penalty, is crucial to model selection. Accordingly, we derive an algorithm that can fit the entire path of solutions for every value of the hyperparameter. Its computational complexity after preprocessing is quadratic only in the number of labeled examples rather than the total number of labeled and unlabeled examples. © 2006 IEEE.