Lyu T.,Peking University |
Bing L.,Tencent Inc. |
Zhang Z.,Peking University |
Zhang Y.,Peking University
Proceedings - IEEE International Conference on Data Mining, ICDM | Year: 2017
Community detection is a hot topic for researchers in the fields including graph theory, social networks and biological networks. Generally speaking, a community refers to a group of densely linked nodes in the network. Nodes usually have more than one community label, indicating their multiple roles or functions in the network. Unfortunately, existing solutions aiming at overlapping-community-detection are not capable of scaling to large-scale networks with millions of nodes and edges. In this paper, we propose a fast overlapping-communitydetection algorithm - FOX. In the experiment on a network with 3.9 millions nodes and 20 millions edges, the detection finishes in 14 minutes and provides the most qualified results. The second fastest algorithm, however, takes ten times longer to run. As for another network with 22 millions nodes and 127 millions edges, our algorithm is the only one that can provide an overlapping community detection result and it only takes 238 minutes. Our algorithm draws lessons from potential games, a concept in game theory. We measure the closeness of a node to a community by counting the number of triangles formed by the node and two other nodes form the community. Potential games ensure that the algorithm can reach convergence. We also extend the exploitation of triangle to open-Triangle, which enlarges the scale of the detected communities. © 2016 IEEE.
Huang Y.,Peking University |
Cui B.,Peking University |
Zhang W.,Tencent Inc. |
Jiang J.,Tencent Inc. |
Xu Y.,Peking University
Proceedings of the ACM SIGMOD International Conference on Management of Data | Year: 2015
With the arrival of the big data era, opportunities as well as challenges arise in both industry and academia. As an important service in most web applications, accurate real-time recommendation in the context of big data is of high demand. Traditional recommender systems that analyze data and update models at regular time intervals cannot satisfy the requirements of modern web applications, calling for real-time recommender systems. In this paper, we tackle the "big", "real-time" and "accurate" challenges in real-time recommendation, and propose a general real-time stream recommender system built on Storm named TencentRec from three aspects, i.e., "system", "algorithm", and "data". We analyze the large amount of data streams from a wide range of applications leveraging the considerable computation ability of Storm, together with a data access component and a data storage component developed by us. To deal with various application specific demands, we have implemented several classic practical recommendation algorithms in TencentRec, including the item-based collaborative filtering, the content based, and the demographic based algorithms. Specially, we present a practical scalable item-based CF algorithm in detail, with the super characteristics such as robust to the implicit feedback problem, incremental update and real-time pruning. With the enhancement of real-time data collection and processing, we can capture the recommendation changes in real-time. We deploy the TencentRec in a series of production applications, and observe the superiority of TencentRec in providing accurate real-time recommendations for 10 billion user requests everyday. Copyright © 2015 ACM.
Zhong E.,Hong Kong University of Science and Technology |
Fan W.,IBM |
Wang J.,Tencent Inc. |
Xiao L.,Tencent Inc. |
Li Y.,Tencent Inc.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining | Year: 2012
Accurate prediction of user behaviors is important for many social media applications, including social marketing, personalization and recommendation, etc. A major challenge lies in that, the available behavior data or interactions between users and items in a given social network are usually very limited and sparse (e.g., >= 99.9% empty). Many previous works model user behavior from only historical user logs. We observe that many people are members of several social networks in the same time, such as Facebook, Twitter and Tencent's QQ. Importantly, their behaviors and interests in different networks influence one another. This gives us an opportunity to leverage the knowledge of user behaviors in different networks, in order to alleviate the data sparsity problem, and enhance the predictive performance of user modeling. Combining different networks "simply and naively" does not work well. Instead, we formulate the problem to model multiple networks as "composite network knowledge transfer". We first select the most suitable networks inside a composite social network via a hierarchical Bayesian model, parameterized for individual users, and then build topic models for user behavior prediction using both the relationships in the selected networks and related behavior data. To handle big data, we have implemented the algorithm using Map/Reduce. We demonstrate that the proposed composite network-based user behavior model significantly improve the predictive accuracy over a number of existing approaches on several real world applications, such as a very large social-networking dataset from Tencent Inc. © 2012 ACM.
Wang G.,Tencent Inc. |
Wang F.,IBM |
Chen T.,Hong Kong University of Science and Technology |
Yeung D.-Y.,Hong Kong University of Science and Technology |
Lochovsky F.H.,Hong Kong University of Science and Technology
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics | Year: 2012
Traditional learning algorithms use only labeled data for training. However, labeled examples are often difficult or time consuming to obtain since they require substantial human labeling efforts. On the other hand, unlabeled data are often relatively easy to collect. Semisupervised learning addresses this problem by using large quantities of unlabeled data with labeled data to build better learning algorithms. In this paper, we use the manifold regularization approach to formulate the semisupervised learning problem where a regularization framework which balances a tradeoff between loss and penalty is established. We investigate different implementations of the loss function and identify the methods which have the least computational expense. The regularization hyperparameter, which determines the balance between loss and penalty, is crucial to model selection. Accordingly, we derive an algorithm that can fit the entire path of solutions for every value of the hyperparameter. Its computational complexity after preprocessing is quadratic only in the number of labeled examples rather than the total number of labeled and unlabeled examples. © 2006 IEEE.
Wang S.,Beijing Institute of Technology |
Chen Y.,Tencent Inc.
International Journal of Data Warehousing and Mining | Year: 2014
In this paper, a novel clustering algorithm, HASTA (HierArchical-grid cluStering based on daTA field), is proposed to model the dataset as a data field by assigning all the data objects into qusantized grids. Clustering centers of HASTA are defined to locate where the maximum value of local potential is. Edges of cluster in HASTA are identified by analyzing the first-order partial derivative of potential value, thus the full size of arbitrary shaped clusters can be detected. The experimented case demonstrates that HASTA performs effectively upon different datasets and can find out clusters of arbitrary shapes in noisy circumstance. Besides those, HASTA does not force users to preset the exact amount of clusters inside dataset. Furthermore, HASTA is insensitive to the order of data input. The time complexity of HASTA achieves O(n). Those advantages will potentially benefit the mining of big data. Copyright © 2014, IGI Global.
Liu X.,Nanyang Technological University |
Jiang L.,Tencent Inc. |
Wong T.-T.,Chinese University of Hong Kong |
Fu C.-W.,Nanyang Technological University
IEEE Transactions on Visualization and Computer Graphics | Year: 2012
Estimating illumination and deformation fields on textures is essential for both analysis and application purposes. Traditional methods for such estimation usually require complicated and sometimes labor-intensive processing. In this paper, we propose a new perspective for this problem and suggest a novel statistical approach which is much simpler and more efficient. Our experiments show that many textures in daily life are statistically invariant in terms of colors and gradients. Variations of such statistics can be assumed to be influenced by illumination and deformation. This implies that we can inversely estimate the spatially varying illumination and deformation according to the variation of the texture statistics. This enables us to decompose a texture photo into an illumination field, a deformation field, and an implicit texture which are illumination-and deformation-free, within a short period of time, and with minimal user input. By processing and recombining these components, a variety of synthesis effects, such as exemplar preparation, texture replacement, surface relighting, as well as geometry modification, can be well achieved. Finally, convincing results are shown to demonstrate the effectiveness of the proposed method. © 1995-2012 IEEE.
Niu L.,University of Chinese Academy of Sciences |
Wu J.,Tencent Inc.
Proceedings - 12th IEEE International Conference on Data Mining Workshops, ICDMW 2012 | Year: 2012
Based on L-2 Support Vector Machines(SVMs), Vapnik and Vashist introduced the concept of Learning Using Privileged Information(LUPI). This new paradigm takes into account the elements of human teaching during the process of machine learning. However, with the utilization of privileged information, the extended L-2 SVM model given by Vapnik and Vashist doubles the number of parameters used in the standard L-2 SVM. Lots of computing time would be spent on tuning parameters. In order to reduce this workload, we proposed using L-1 SVM instead of L-2 SVM for LUPI in our previous work. Different from LUPI with L-2 SVM, which is formulated as quadratic programming, LUPI with L-1 SVM is essentially a linear programming and is computationally much cheaper. On this basis, we discuss how to employ the wisdom from teachers better and more flexible by LUPI with L-1 SVM in this paper. By introducing kernels, an extended L-1 SVM model, which is still a linear programming, is proposed. With the help of nonlinear kernels, the new model allows the privileged information be explored in a transformed feature space instead of the original data domain. Numerical experiment is carried out on both time series prediction and digit recognition problems. Experimental results also validate the effectiveness of our new method. © 2012 IEEE.
Li Y.,University of Science and Technology Beijing |
Hu C.,University of Science and Technology Beijing |
Chen M.,Tencent Company |
Hu J.,University of Technology of Troyes
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | Year: 2012
In this paper we investigate aesthetic features in learning aesthetic judgments in an evolutionary art system. We evolve genetic art with our evolutionary art system, BioEAS, by using genetic programming and an aesthetic learning model. The model is built by learning both phenotype and genotype features, which we extracted from internal evolutionary images and external real world paintings, which could lead to more interesting paths. By learning aesthetic judgment and applying the knowledge to evolve aesthetical images, the model helps user to automate the process of evolutionary process. Several independent experimental results show that our system is efficient to reduce user fatigue in evolving art. © 2012 Springer-Verlag.
Wang P.,CAS Institute of Computing Technology |
Meng D.,CAS Institute of Computing Technology |
Han J.,CAS Institute of Computing Technology |
Zhan J.,CAS Institute of Computing Technology |
And 3 more authors.
IEEE Micro | Year: 2010
Cloud computing drives the design and development of diverse programming models for massive data processing. The Transformer programming framework aims to facilitate the building of diverse data-parallel programming models. Transformer has two layers: a common runtime system and a model-specific system. Using Transformer, the authors show how to implement three programming models: Dryad-like data flow, MapReduce, and All-Pairs. © 2006 IEEE.
Feng C.,CAS Institute of Computing Technology |
Zou Y.,Tencent Corporation |
Xu Z.,CAS Institute of Computing Technology
Proceedings - 7th International Conference on Semantics, Knowledge, and Grids, SKG 2011 | Year: 2011
Multi-dimensional range queries are fundamental requirements in large scale Internet applications using Distributed Ordered Tables. Apache Cassandra is a Distributed Ordered Table when it employs order-preserving hashing as data partitioner. Cassandra supports multi-dimensional range queries with poor performance and with a limitation that there must be one dimension with an equal operator. Based on the success of CCIndex scheme in Apache HBase, this paper tries to answer the question: Can CCIndex benefit multi-dimensional range queries in DOTs like Cassandra? This paper studies the feasibility of employing CCIndex in Cassandra, proposes a new approach to estimate result size, implements CCIndex in Cassandra including recovery mechanisms and studies the pros and cons of CCIndex for different DOTs. Experimental results show that CCIndex gains 2.4 to 3.7 times efficiency over Cassandra's index scheme with 1% to 50% selectivity for 2 million records. This paper shows that CCIndex is a general approach for DOTs, and could gain better performance for DOTs which perform scan tasks much faster than random read. This paper reveals that Cassandra is optimized for hash tables rather than ordered tables in performing read and range queries. © 2011 IEEE.