Time filter

Source Type

Papenbrock T.,Hasso Plattner Institute HPI | Kruse S.,Hasso Plattner Institute HPI | Quiane-Ruiz J.-A.,Qatar Computing Research Institute QCRI | Naumann F.,Hasso Plattner Institute HPI
Proceedings of the VLDB Endowment | Year: 2015

The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets - an important property on the face of the ever increasing size of today's data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind. © 2015 VLDB Endowment 21508097/15/03.

Kokash N.,Centrum Wiskunde and Informatica CWI | Krause C.,Hasso Plattner Institute HPI | De Vink E.,TU Eindhoven
Formal Aspects of Computing | Year: 2012

The paradigm of service-oriented computing revolutionized the field of software engineering. According to this paradigm, new systems are composed of existing stand-alone services to support complex crossorganizational business processes. Correct communication of these services is not possible without a proper coordination mechanism. The Reo coordination language is a channel-based modeling language that introduces various types of channels and their composition rules. By composing Reo channels, one can specify Reo connectors that realize arbitrary complex behavioral protocols. Several formalisms have been introduced to give semantics to Reo. In their most basic form, they reflect service synchronization and dataflow constraints imposed by connectors. To ensure that the composed system behaves as intended, we need a wide range of automated verification tools to assist service composition designers. In this paper, we present our framework for the verification of Reo using the mCRL2 toolset. We unify our previous work on mapping various semantic models for Reo, namely, constraint automata, timed constraint automata, coloring semantics and the newly developed action constraint automata, to the process algebraic specification language of mCRL2, address the correctness of this mapping, discuss tool support, and present a detailed example that illustrates the use of Reo empowered with mCRL2 for the analysis of dataflow in service-based process models. © 2011 BCS.

Abedjan Z.,Hasso Plattner Institute HPI | Quiane-Ruiz J.-A.,Qatar Computing Research Institute QCRI | Naumann F.,Hasso Plattner Institute HPI
Proceedings - International Conference on Data Engineering | Year: 2014

The discovery of all unique (and non-unique) column combinations in an unknown dataset is at the core of any data profiling effort. Unique column combinations resemble candidate keys of a relational dataset. Several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are suitable for applications on dynamic datasets, such as transactional databases, social networks, and scientific applications. In these cases, data profiling techniques should be able to efficiently discover new uniques and non-uniques (and validate old ones) after tuple inserts or deletes, without re-profiling the entire dataset. We present the first approach to efficiently discover unique and non-unique constraints on dynamic datasets that is independent of the initial dataset size. In particular, Swan makes use of intelligently chosen indices to minimize access to old data. We perform an exhaustive analysis of Swan and compare it with two state-of-the-art techniques for unique discovery: Gordian and Ducc. The results show that Swan significantly outperforms both, as well as their incremental adaptations. For inserts, Swan is more than 63x faster than Gordian and up to 50x faster than Ducc. For deletes, Swan is more than 15x faster than Gordian and up to 1 order of magnitude faster than Ducc. In fact, Swan even improves on the static case by dividing the dataset into a static part and a set of inserts. © 2014 IEEE.

Kruse S.,Hasso Plattner Institute HPI | Jentzsch A.,Hasso Plattner Institute HPI | Papenbrock T.,Hasso Plattner Institute HPI | Kaoudi Z.,Qatar Computing Research Institute QCRI | And 2 more authors.
Proceedings of the ACM SIGMOD International Conference on Management of Data | Year: 2016

Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before. © 2016 ACM.

Kruse S.,Hasso Plattner Institute HPI | Papotti P.,Qatar Computing Research Institute QCRI | Naumann F.,Hasso Plattner Institute HPI
EDBT 2015 - 18th International Conference on Extending Database Technology, Proceedings | Year: 2015

Data cleaning and data integration have been the topic of intensive research for at least the past thirty years, resulting in a multitude of specialized methods and integrated tool suites. All of them require at least some and in most cases significant human input in their configuration, during processing, and for evaluation. For managers (and for developers and scientists) it would be therefore of great value to be able to estimate the effort of cleaning and integrating some given data sets and to know the pitfalls of such an integration project in advance. This helps deciding about an integration project using cost/benefit analysis, budgeting a team with funds and manpower, and monitoring its progress. Further, knowledge of how well a data source fits into a given data ecosystem improves source selection. We present an extensible framework for the automatic effort estimation for mapping and cleaning activities in data integration projects with multiple sources. It comprises a set of measures and methods for estimating integration complexity and ultimately effort, taking into account heterogeneities of both schemas and instances and regarding both integration and cleaning operations. Experiments on two real-world scenarios show that our proposal is two to four times more accurate than a current approach in estimating the time duration of an integration process, and provides a meaningful breakdown of the integration problems as well as the required integration activities. © 2015, Copyright is with the authors.

Heise A.,Hasso Plattner Institute HPI | Quiane-Ruiz J.,Qatar Computing Research Institute QCRI | Abedjan Z.,Hasso Plattner Institute HPI | Jentzsch A.,Hasso Plattner Institute HPI | Naumann F.,Hasso Plattner Institute HPI
Proceedings of the VLDB Endowment | Year: 2013

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving effciency and scalability in this context is a tremendous challenge by itself. In this paper, we devise Ducc, a scalable and effcient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the effciency of Ducc to scale up and out. © 2013 VLDB Endowment 21508097/13/12.

Dong X.L.,AT and T Labs Research | Naumann F.,Hasso Plattner Institute HPI
SIGMOD Record | Year: 2010

WebDB 2010, the 13th International Workshop on the Web and Databases, took place on June 6, 2010. Christian Bizer, cofounder of the DBpedia project, compared the Linked Data movement, which stems from the Semantic Web research area, with research in the field of Dataspaces. The research session entitled Linked data and Wikipedia featured papers entitled 'An agglomerative query model for discovery in linked data: semantics and approach' and 'XML-based RDF data management for efficient query processing'. The other sessions of the workshop included papers entitled 'Find your advisor: robust knowledge gathering from the Web', 'Redundancy-driven web data extraction and integration', and 'Using latent-structure to detect objects on the Web'. Topics such as 'Manimal: relational optimization for data-intensive programs' and 'Learning topical transition probabilities in click through data with regression models' were also discussed.

Papenbrock T.,Hasso Plattner Institute HPI | Naumann F.,Hasso Plattner Institute HPI
Proceedings of the ACM SIGMOD International Conference on Management of Data | Year: 2016

Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually unknown and almost impossible to discover manually. For this reason, database research has proposed various algorithms for functional dependency discovery. None, however, are able to process datasets of typical real-world size, e.g., datasets with more than 50 attributes and a million records. We present a hybrid discovery algorithm called HyFD, which combines fast approximation techniques with efficient validation techniques in order to find all minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets. © 2016 ACM.

Abedjan Z.,Hasso Plattner Institute HPI | Gruetze T.,Hasso Plattner Institute HPI | Jentzsch A.,Hasso Plattner Institute HPI | Naumann F.,Hasso Plattner Institute HPI
Proceedings - International Conference on Data Engineering | Year: 2014

Before reaping the benefits of open data to add value to an organizations internal data, such new, external datasets must be analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data profiling, already difficult for large relational data sources, is even more challenging for RDF datasets, the preferred data model for linked open data. We present ProLod++, a novel tool for various profiling and mining tasks to understand and ultimately improve open RDF data. ProLod++ comprises various traditional data profiling tasks, adapted to the RDF data model. In addition, it features many specific profiling results for open data, such as schema discovery for user-generated attributes, association rule discovery to uncover synonymous predicates, and uniqueness discovery along ontology hierarchies. ProLod++ is highly efficient, allowing interactive profiling for users interested in exploring the properties and structure of yet unknown datasets. © 2014 IEEE.

Jentzsch A.,Hasso Plattner Institute HPI | Muhleisen H.,Centrum Wiskunde and Informatica CWI | Naumann F.,Hasso Plattner Institute HPI
CEUR Workshop Proceedings | Year: 2015

TheWeb of Data contains a large number of openly-available datasets covering a wide variety of topics. In order to benefit from this massive amount of open data, e.g., to add value to an organization's internal data, such external datasets must be analyzed and understood already at the basic level of data types, uniqueness, constraints, value patterns, etc. For Linked Datasets and other Web data such meta information is currently quite limited or not available at all. Data profiling techniques are needed to compute respective statistics and meta information. Analyzing datasets along the vocabulary-defined taxonomic hierarchies yields further insights, such as the data distribution at different hierarchy levels, or possible mappings betweens vocabularies or datasets. In particular, key candidates for entities are diffcult to find in light of the sparsity of property values on the Web of Data. To this end we introduce the concept of keyness and perform a comprehensive analysis of its expressiveness on multiple datasets.

Loading Hasso Plattner Institute HPI collaborators
Loading Hasso Plattner Institute HPI collaborators