Kourtellis N.,Telefonica |
De Francisci Morales G.,QCRI |
Bifet A.,Telecom ParisTech |
Murdopo A.,LARC SMU
Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 | Year: 2016
IoT big data requires new machine learning methods able to scale to large size of data arriving at high speed. Decision trees are popular machine learning models since they are very effective, yet easy to interpret and visualize. In the literature, we can find distributed algorithms for learning decision trees, and also streaming algorithms, but not algorithms that combine both features. In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining big data streams, and thus able to run on real-world clusters. Our experiments to study the accuracy and throughput of VHT prove its ability to scale while attaining superior performance compared to sequential decision trees. © 2016 IEEE.
Banerjee S.,Pennsylvania State University |
Mitra P.,QCRI |
Sugiyama K.,National University of Singapore
IJCAI International Joint Conference on Artificial Intelligence | Year: 2015
Abstractive summarization is an ideal form of summarization since it can synthesize information from multiple documents to create concise informative summaries. In this work, we aim at developing an abstractive summarizer. First, our proposed approach identifies the most important document in the multi-document set. The sentences in the most important document are aligned to sentences in other documents to generate clusters of similar sentences. Second, we generate K-shortest paths from the sentences in each cluster using a word-graph structure. Finally, we select sentences from the set of shortest paths generated from all the clusters employing a novel integer linear programming (ILP) model with the objective of maximizing information content and readability of the final summary. Our ILP model represents the shortest paths as binary variables and considers the length of the path, information score and linguistic quality score in the objective function. Experimental results on the DUC 2004 and 2005 multi-document summarization datasets show that our proposed approach outperforms all the baselines and state-of-the-art extractive summarizers as measured by the ROUGE scores. Our method also outperforms a recent abstractive summarization technique. In manual evaluation, our approach also achieves promising results on informativeness and readability.
Park K.,KAIST |
Weber I.,QCRI |
Cha M.,My Fitness Pal United States |
Proceedings of the ACM Conference on Computer Supported Cooperative Work, CSCW | Year: 2016
As the world becomes more digitized and interconnected, information that was once considered to be private such as one's health status is now being shared publicly. To understand this new phenomenon better, it is crucial to study what types of health information are being shared on social media and why, as well as by whom. In this paper, we study the traits of users who share their personal health and fitness related information on social media by analyzing fitness status updates that MyFitnessPal users have shared via Twitter. We investigate how certain features like user profile, fitness activity, and fitness network in social media can potentially impact the longterm engagement of fitness app users. We also discuss implications of our findings to achieve a better retention of these users and to promote more sharing of their status updates. © 2016 ACM.
Fan W.,University of Edinburgh |
Fan W.,Beihang University |
Geerts F.,University of Antwerp |
Tang N.,QCRI |
Yu W.,University of Edinburgh
Proceedings - International Conference on Data Engineering | Year: 2013
This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This problem is important in data integration, data cleaning and query answering. It is, however, challenging since in practice, reliable timestamps are often absent, among other things. We propose a model for conflict resolution, by specifying data currency in terms of partial currency orders and currency constraints, and by enforcing data consistency with constant conditional functional dependencies. We show that identifying data currency orders helps us repair inconsistent data, and vice versa. We investigate a number of fundamental problems associated with conflict resolution, and establish their complexity. In addition, we introduce a framework and develop algorithms for conflict resolution, by integrating data currency and consistency inferences into a single process, and by interacting with users. We experimentally verify the accuracy and efficiency of our methods using real-life and synthetic data. © 2013 IEEE.
News Article | November 4, 2015
The converted video can be played back over any 3-D device—a commercial 3-D TV, Google's new Cardboard system, which turns smartphones into 3-D displays, or special-purpose displays such as Oculus Rift. The researchers presented the new system last week at the Association for Computing Machinery's Multimedia conference. "Any TV these days is capable of 3-D," says Wojciech Matusik, an associate professor of electrical engineering and computer science at MIT and one of the system's co-developers. "There's just no content. So we see that the production of high-quality content is the main thing that should happen. But sports is very hard. With movies, you have artists who paint the depth map. Here, there is no luxury of hiring 100 artists to do the conversion. This has to happen in real-time." The system is one result of a collaboration between QCRI and MIT's Computer Science and Artificial Intelligence Laboratory. Joining Matusik on the conference paper are Kiana Calagari, a research associate at QCRI and first author; Alexandre Kaspar, an MIT graduate student in electrical engineering and computer science; Piotr Didyk, who was a postdoc in Matusik's group and is now a researcher at the Max Planck Institute for Informatics; Mohamed Hefeeda, a principal scientist at QCRI; and Mohamed Elgharib, a QCRI postdoc. QCRI also helped fund the project. In the past, researchers have tried to develop general-purpose systems for converting 2-D video to 3-D, but they haven't worked very well and have tended to produce odd visual artifacts that detract from the viewing experience. "Our advantage is that we can develop it for a very specific problem domain," Matusik says. "We are developing a conversion pipeline for a specific sport. We would like to do it at broadcast quality, and we would like to do it in real-time. What we have noticed is that we can leverage video games." Today's video games generally store very detailed 3-D maps of the virtual environment that the player is navigating. When the player initiates a move, the game adjusts the map accordingly and, on the fly, generates a 2-D projection of the 3-D scene that corresponds to a particular viewing angle. The MIT and QCRI researchers essentially ran this process in reverse. They set the very realistic Microsoft soccer game "FIFA13" to play over and over again, and used Microsoft's video-game analysis tool PIX to continuously store screen shots of the action. For each screen shot, they also extracted the corresponding 3-D map. Using a standard algorithm for gauging the difference between two images, they winnowed out most of the screen shots, keeping just those that best captured the range of possible viewing angles and player configurations that the game presented; the total number of screen shots still ran to the tens of thousands. Then they stored each screen shot and the associated 3-D map in a database. For every frame of 2-D video of an actual soccer game, the system looks for the 10 or so screen shots in the database that best correspond to it. Then it decomposes all those images, looking for the best matches between smaller regions of the video feed and smaller regions of the screen shots. Once it's found those matches, it superimposes the depth information from the screen shots on the corresponding sections of the video feed. Finally, it stitches the pieces back together. The result is a very convincing 3-D effect, with no visual artifacts. The researchers conducted a user study in which the majority of subjects gave the 3-D effect a rating of 5 ("excellent") on a five-point ("bad" to "excellent") scale; the average score was between 4 ("good") and 5. Currently, the researchers say, the system takes about a third of a second to process a frame of video. But successive frames could all be processed in parallel, so that the third-of-a-second delay needs to be incurred only once. A broadcast delay of a second or two would probably provide an adequate buffer to permit conversion on the fly. Even so, the researchers are working to bring the conversion time down still further. "This is a clever use of game content, which leads to better results and easier acquisition of large and diverse reference data," says Hanspeter Pfister, a professor of computer science at Harvard University. "One of the main insights of the paper is that domain-specific methods are able to yield bigger improvements than more general approaches. This is an important lesson that will have ramifications for other domains." Explore further: Pico projector used in eye based video gaming system More information: Gradient-based 2D-to-3D Conversion for Soccer Videos. MM '15 Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. Pages 331-340 DOI: 10.1145/2733373.2806262
News Article | November 4, 2015
The finesse dribbling, pinpoint passes, powerful kicks, saves at the net and diehard fans are some of the facets that make soccer the world's most-popular sport — one that more than warrants 3D viewing to fully bring those thrilling plays to life. MIT seemed to think so, too, working with the Qatar Computing Research Institute (QCRI) to develop a system that automatically converts 2D video of soccer games into 3D. They debuted the system at the Association for Computing Machinery's Multimedia conference last week. According to MIT News, by exploiting video-game software, MIT and QCRI combined to broadcast 3D video of soccer games in real time. MIT researchers say once the video is converted, it can be played back over any 3D device — whether Google's Cardboard, an Oculus Rift headset or a commercial 3D television. About the latter, though, just because a TV supports 3D viewing, doesn't mean that there's available content for it. That gave MIT and QCRI extra incentive to convert 2D video of soccer games into 3D. "Any TV these days is capable of 3D," Wojciech Matusik, an MIT associate professor of electrical engineering and computer science and a system co-developer, told MIT News. "There's just no content. So we see that the production of high-quality content is the main thing that should happen. But sports is very hard. With movies, you have artists who paint the depth map. Here, there is no luxury of hiring 100 artists to do the conversion. This has to happen in real-time." The automatic 2D to 3D video conversion is pretty impressive to see, as this video below suggests. Just imagine how the World Cup would look in 3D and what that kind of viewing experience would be like for fans around the globe.
Chu X.,University of Waterloo |
Ilyas I.F.,QCRI |
Proceedings of the VLDB Endowment | Year: 2013
Integrity constraints (ICs) provide a valuable tool for enforcing correct application semantics. However, designing ICs requires experts and time. Proposals for automatic discovery have been made for some formalisms, such as functional dependencies and their extension conditional functional dependencies. Unfortunately, these dependencies cannot express many common business rules. For example, an American citizen cannot have lower salary and higher tax rate than another citizen in the same state. In this paper, we tackle the challenges of discovering dependencies in a more expressive integrity constraint language, namely Denial Constraints (DCs). DCs are expressive enough to overcome the limits of previous languages and, at the same time, have enough structure to allow efficient discovery and application in several scenarios. We lay out theoretical and practical foundations for DCs, including a set of sound inference rules and a linear algorithm for implication testing. We then develop an efficient instance-driven DC discovery algorithm and propose a novel scoring function to rank DCs for user validation. Using real-world and synthetic datasets, we experimentally evaluate scalability and effectiveness of our solution. © 2013 VLDB Endowment.
Abbar S.,QCRI |
Amer-Yahia S.,French National Center for Scientific Research |
Indyk P.,Massachusetts Institute of Technology |
Mahabadi S.,Massachusetts Institute of Technology
WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web | Year: 2013
News articles typically drive a lot of traffic in the form of comments posted by users on a news site. Such user-generated content tends to carry additional information such as entities and sentiment. In general, when articles are recommended to users, only popularity (e.g., most shared and most commented), recency, and sometimes (manual) editors' picks (based on daily hot topics), are considered. We formalize a novel recommendation problem where the goal is to find the closest most diverse articles to the one the user is currently browsing. Our diversity measure incorporates entities and sentiment extracted from comments. Given the realtime nature of our recommendations, we explore the applicability of nearest neighbor algorithms to solve the problem. Our user study on real opinion articles from aljazeera.net and reuters.com validates the use of entities and sentiment extracted from articles and their comments to achieve news diversity when compared to content-based diversity. Finally, our performance experiments show the real-time feasibility of our solution. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Kwak H.,QCRI |
Blackburn J.,Telefonica |
Han S.,University of Washington
Conference on Human Factors in Computing Systems - Proceedings | Year: 2015
In this work we explore cyberbullying and other toxic behavior in team competition online games. Using a dataset of over 10 million player reports on 1.46 million toxic players along with corresponding crowdsourced decisions, we test several hypotheses drawn from theories explaining toxic behavior. Besides providing large-scale, empirical based understanding of toxic behavior, our work can be used as a basis for building systems to detect, prevent, and counter-act toxic behavior. © Copyright 2015 ACM.
Jindal A.,Massachusetts Institute of Technology |
Quiane-Ruiz J.,QCRI |
Madden S.,Massachusetts Institute of Technology
Proceedings of the ACM SIGMOD International Conference on Management of Data | Year: 2013
Modern enterprises have to deal with a variety of analytical queries over very large datasets. In this respect, Hadoop has gained much popularity since it scales to thousand of nodes and terabytes of data. However, Hadoop suffers from poor performance, especially in I/O performance. Several works have proposed alternate data storage for Hadoop in order to improve the query performance. However, many of these works end up making deep changes in Hadoop or HDFS. As a result, they are (i) difficult to adopt by several users, and (ii) not compatible with future Hadoop releases. In this paper, we present CARTILAGE, a comprehensive data storage framework built on top of HDFS. CARTILAGE allows users full control over their data storage, including data partitioning, data replication, data layouts, and data placement. Furthermore, CARTILAGE can be layered on top of an existing HDFS installation. This means that Hadoop, as well as other query engines, can readily make use of CARTILAGE. We describe several use-cases of CARTILAGE and propose to demonstrate the flexibility and efficiency of CARTILAGE through a set of novel scenarios. Copyright © 2013 ACM.