Table of Contents
Fetching ...

Relational Deep Dive: Error-Aware Queries Over Unstructured Data

Daren Chao, Kaiwen Chen, Naiqing Guan, Nick Koudas

TL;DR

Relational Deep Dive (ReDD) addresses the challenge of executing analytical queries over unstructured text by dynamically constructing a query-specific relational schema and populating it with corrected data. The framework combines Iterative Schema Discovery (ISD) to derive minimal joinable schemas with Tabular Data Population (TDP) that uses lightweight classifiers trained on LLM hidden states for error detection. Its core novelty lies in SCAPE, a statistically calibrated conformal-prediction-based method, and SCAPE-Hyb, which integrates a conflict signal to balance accuracy and human-correction cost while maintaining coverage guarantees. Across diverse datasets, ReDD reduces extraction errors from up to 30% to below 1% and achieves 100% schema recall with high precision, enabling robust, high-stakes analytical queries over unstructured corpora. The approach offers tunable accuracy-cost trade-offs, supports human-in-the-loop intervention, and scales to large document collections with provable guarantees.

Abstract

Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.

Relational Deep Dive: Error-Aware Queries Over Unstructured Data

TL;DR

Relational Deep Dive (ReDD) addresses the challenge of executing analytical queries over unstructured text by dynamically constructing a query-specific relational schema and populating it with corrected data. The framework combines Iterative Schema Discovery (ISD) to derive minimal joinable schemas with Tabular Data Population (TDP) that uses lightweight classifiers trained on LLM hidden states for error detection. Its core novelty lies in SCAPE, a statistically calibrated conformal-prediction-based method, and SCAPE-Hyb, which integrates a conflict signal to balance accuracy and human-correction cost while maintaining coverage guarantees. Across diverse datasets, ReDD reduces extraction errors from up to 30% to below 1% and achieves 100% schema recall with high precision, enabling robust, high-stakes analytical queries over unstructured corpora. The approach offers tunable accuracy-cost trade-offs, supports human-in-the-loop intervention, and scales to large document collections with provable guarantees.

Abstract

Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.

Paper Structure

This paper contains 36 sections, 4 theorems, 44 equations, 8 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

Under the assumption that calibration and test examples are exchangeable, the conformal prediction set determined by SCAPE satisfies: where $\hat{y}^{{\textsc{SCAPE}}\xspace}_{k,\circ} = \left\{ y \in \{0,1\} \mid \mathbf{s}(y) \in \mathcal{C}_\alpha \right\}$.

Figures (8)

  • Figure 1: Overview of the query processing pipeline in ReDD. The left side of the dashed line shows the raw input, consisting of a natural language query and a collection of unstructured document chunks. The right side illustrates the core system workflow of ReDD, comprising: (A) schema discovery; (B) data population; and (C) SQL query generation and execution (not the focus of this work). Within the data population component (B), an error correction mechanism is integrated to automatically detect and rectify low-confidence extractions, enabling controllable accuracy.
  • Figure 2: Trade-off between data population accuracy ($\textit{ACC}_\textit{pop}$) and correction cost measured by the false positive rate ($\textit{FPR}_\textit{pop}$). For the SCAPE-Hyb curve, labels above each point indicate the corresponding $\alpha$ value used to produce that accuracy–cost trade-off.
  • Figure 3: Data population accuracy $\textit{ACC}_\textit{pop}$ for SCAPE-Hyb with different calibration dataset size $N_\text{cal-base}$ on dataset Spider, varying conflict weight $\lambda$, under $\textit{FPR}_\textit{pop}{=}0.2$.
  • Figure 4: Data population accuracy $\textit{ACC}_\textit{pop}$ of SCAPE and SCAPE-Hyb varying calibration dataset size $N_\text{cal-base}$, under $\textit{FPR}_\textit{pop}{=}0.2$.
  • Figure 5: Data population accuracy $\textit{ACC}_\textit{pop}$ varying training dataset size $N_\text{cls}$, under $\textit{FPR}_\textit{pop}{=}0.2$.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Theorem 4.1: Coverage Guarantee under Exchangeability
  • Proof
  • Theorem 4.2: Optimal Set Size of SCAPE
  • Proof
  • Theorem 4.3: Optimality of SCAPE-Hyb
  • Proof
  • Theorem 4.4: Optimality of SCAPE-Hyb over SCAPE
  • Proof