Table of Contents
Fetching ...

Fast Algorithm for Embedded Order Dependency Validation (Extended Version)

Alejandro Ramos, Takuya Uemura, Daichi Amagata, Ryo Shirai, Takahiro Hara

TL;DR

This paper introduces the novel notion of Embedded ODs (eODs) to deal with missing values, and proposes an efficient heuristic algorithm for validating embedded ODs.

Abstract

Order Dependencies (ODs) have many applications, such as query optimization, data integration, and data cleaning. Although many works addressed the problem of discovering OD (and its variants), they do not consider datasets with missing values, a standard observation in real-world datasets. This paper introduces the novel notion of Embedded ODs (eODs) to deal with missing values. The intuition of eODs is to confirm ODs only on tuples with no missing values on a given embedding (a set of attributes). In this paper, we address the problem of validating a given eOD. If the eOD holds, we return true. Otherwise, we search for an updated embedding such that the updated eOD holds. If such embedding does not exist, we return false. A trivial requirement is to consider an embedding such that the number of ignored tuples is minimized. We show that it is NP-complete to compute such embedding. We therefore propose an efficient heuristic algorithm for validating embedded ODs. We conduct experiments on real-world datasets, and the results confirm the efficiency of our algorithm.

Fast Algorithm for Embedded Order Dependency Validation (Extended Version)

TL;DR

This paper introduces the novel notion of Embedded ODs (eODs) to deal with missing values, and proposes an efficient heuristic algorithm for validating embedded ODs.

Abstract

Order Dependencies (ODs) have many applications, such as query optimization, data integration, and data cleaning. Although many works addressed the problem of discovering OD (and its variants), they do not consider datasets with missing values, a standard observation in real-world datasets. This paper introduces the novel notion of Embedded ODs (eODs) to deal with missing values. The intuition of eODs is to confirm ODs only on tuples with no missing values on a given embedding (a set of attributes). In this paper, we address the problem of validating a given eOD. If the eOD holds, we return true. Otherwise, we search for an updated embedding such that the updated eOD holds. If such embedding does not exist, we return false. A trivial requirement is to consider an embedding such that the number of ignored tuples is minimized. We show that it is NP-complete to compute such embedding. We therefore propose an efficient heuristic algorithm for validating embedded ODs. We conduct experiments on real-world datasets, and the results confirm the efficiency of our algorithm.
Paper Structure (15 sections, 5 theorems, 2 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 5 theorems, 2 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

$\mathbf{E}: A \mapsto_{\leq} B$ is valid, iff there is neither a split nor a swap on attributes $A,B$ in $\mathbf{r}^{\mathbf{E}}$.

Figures (2)

  • Figure 1: Running time vs. LHS and RHS sizes
  • Figure 2: Average $|\mathbf{S}|$ and $|\mathbf{M}|$ vs. LHS and RHS sizes

Theorems & Definitions (12)

  • Definition 1: Order dependencies szlichta2012fundamentals
  • Definition 2: Embedded order dependencies
  • Example 1
  • Definition 3: Split
  • Definition 4: Merge
  • Definition 5: Swap
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1
  • ...and 2 more