Konda P.,University of Wisconsin - Madison |
Das S.,University of Wisconsin - Madison |
Paul Suganthan G.C.,University of Wisconsin - Madison |
Doan A.,University of Wisconsin - Madison |
And 10 more authors.
Proceedings of the VLDB Endowment | Year: 2016
Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We dis- cuss the limitations of current EM systems, then present as a solution Magellan, a new kind of EM systems. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users do these steps; the tools seek to cover the entire EM pipeline, not just match- ing and blocking as current EM systems do. (3) Tools are built on top of the data analysis and Big Data stacks in Python, allowing Magellan to borrow a rich set of capabil- ities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to fa- cilitate interactive experimentation and quick "patching" of the system. We describe research challenges raised by Mag- ellan, then present extensive experiments with 44 students and users at several organizations that show the promise of the Magellan approach. © 2016 VLDB Endowment 21508097/ 16/08.
Lam W.,atWalmartLabs |
Liu L.,atWalmartLabs |
Prasad S.,atWalmartLabs |
Rajaraman A.,atWalmartLabs |
And 3 more authors.
Proceedings of the VLDB Endowment | Year: 2012
MapReduce has emerged as a popular method to process big data. In the past few years, however, not just big data, but fast data has also exploded in volume and availability. Ex-amples of such data include sensor data streams, the Twit-ter Firehose, and Facebook updates. Numerous applications must process fast data. Can we provide a MapReduce-style framework so that developers can quickly write such applica-tions and execute them over a cluster of machines, to achieve low latency and high scalability? In this paper we report on our investigation of this ques-tion, as carried out at Kosmix and WalmartLabs. We de-scribeMapUpdate, a framework likeMapReduce, but specif-ically developed for fast data. We describe Muppet, our im-plementation of MapUpdate. Throughout the description we highlight the key challenges, argue why MapReduce is not well suited to address them, and briefly describe our current solutions. Finally, we describe our experience and lessons learned with Muppet, which has been used exten-sively at Kosmix and WalmartLabs to power a broad range of applications in social media and e-commerce. © 2012 VLDB Endowment.
Gattani A.,AtWalmartLabs |
Lamba D.S.,AtWalmartLabs |
Garera N.,AtWalmartLabs |
Chai X.,AtWalmartLabs |
And 6 more authors.
Proceedings of the VLDB Endowment | Year: 2013
Many applications that process social data, such as tweets, must extract entities from tweets (e.g., "Obama" and "Hawaii" in "Obama went to Hawaii"), link them to entities in a knowledge base (e.g., Wikipedia), classify tweets into a set of predefined topics, and assign descriptive tags to tweets. Few solutions exist today to solve these problems for social data, and they are limited in important ways. Further, even though several industrial systems such as OpenCalais have been deployed to solve these problems for text data, little if any has been published about them, and it is unclear if any of the systems has been tailored for social media. In this paper we describe in depth an end-to-end indus-trial system that solves these problems for social data. The system has been developed and used heavily in the past three years, first at Kosmix, a startup, and later at Wal-martLabs. We show how our system uses a Wikipedia-based global "real-time" knowledge base that is well suited for so-cial data, how we interleave the tasks in a synergistic fash-ion, how we generate and use contexts and social signals to improve task accuracy, and how we scale the system to the entire Twitter firehose. We describe experiments that show that our system outperforms current approaches. Fi-nally we describe applications of the system at Kosmix and WalmartLabs, and lessons learned. © 2013 VLDB.