Query-Guided Analysis and Mitigation of Data Verification Errors (Extended Version)

Ran Schreiber; Yael Amsterdamer

Query-Guided Analysis and Mitigation of Data Verification Errors (Extended Version)

Ran Schreiber, Yael Amsterdamer

TL;DR

This work introduces Maximal Error Score (MES), a worst-case uncertainty metric that quantifies the reliability of query output tuples independently of the underlying data distribution, and develops efficient algorithms for computing MES and detecting risky tuples, as well as a generic algorithm that builds on both indicators and interacts with external verifiers to select effective additional verification steps.

Abstract

Data verification, the process of labeling data items as correct or incorrect, is a preprocessing step that may critically affect the quality of results in data-driven pipelines. Despite recent advances, verification can still produce erroneous labels that propagate to downstream query results in complex ways. We present a framework that complements existing verification tools by assessing the impact of potential labeling errors on query outputs and guiding additional verification steps to improve result reliability. To this end, we introduce Maximal Error Score (MES), a worst-case uncertainty metric that quantifies the reliability of query output tuples independently of the underlying data distribution. As an auxiliary indicator, we identify risky tuples - input tuples for which reducing label uncertainty may counterintuitively increase the output uncertainty. We then develop efficient algorithms for computing MES and detecting risky tuples, as well as a generic algorithm, named MESReduce, that builds on both indicators and interacts with external verifiers to select effective additional verification steps. We implement our techniques in a prototype system and evaluate them on real and synthetic datasets, demonstrating that MESReduce can substantially and effectively reduce the MES and improve the accuracy of verification results.

Query-Guided Analysis and Mitigation of Data Verification Errors (Extended Version)

TL;DR

Abstract

Paper Structure (30 sections, 13 theorems, 15 equations, 9 figures, 9 tables, 3 algorithms)

This paper contains 30 sections, 13 theorems, 15 equations, 9 figures, 9 tables, 3 algorithms.

Introduction
Framework Components and Flow
Uncertainty Indicators
Contributions Overview and Novelty
Model
Provenance
Uncertainty Indicators
Maximal Error Score (MES) Metric
MES for a single output tuple
MES for multiple tuples
Generalizing beyond SPJU queries
Risky Tuples
Computing the Indicators
Computing MES for Incorrect Output Tuples
Computing MES for Correct Output Tuples
...and 15 more sections

Key Result

Lemma 1

Let $\hat{D} = \langle{\overline{D}, X, L}\rangle$ be an annotated DES, $Q$ be an SPJU query, and $o \in Q(D)$ be an output tuple with a known correctness label. Let $t_0$ be a risky input tuple. Then,

Figures (9)

Figure 1: The flow of our framework
Figure 2: Example query $Q_{\mathrm{ex}}$
Figure 3: $F_1$ metric as a function of the verification cost for selected algorithms on NELL's query Q4 in Scenario AVG
Figure 4: LLM prompt for the verification of a tuple (entity="E", relation="R", value="V"). Concrete values of E, R, V are substituted.
Figure 5: NELL query $Q_{3}$
...and 4 more figures

Theorems & Definitions (47)

Example 1
Definition 1
Example 2: DES
Example 3
Definition 2: Possible worlds and label errors
Example 4: Possible world
Definition 3: Labeling probability
Example 5
Definition 4: output correctness labeling
Example 6: Querying the DES
...and 37 more

Query-Guided Analysis and Mitigation of Data Verification Errors (Extended Version)

TL;DR

Abstract

Query-Guided Analysis and Mitigation of Data Verification Errors (Extended Version)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (47)