Relational Deep Dive: Error-Aware Queries Over Unstructured Data
Daren Chao, Kaiwen Chen, Naiqing Guan, Nick Koudas
TL;DR
Relational Deep Dive (ReDD) addresses the challenge of executing analytical queries over unstructured text by dynamically constructing a query-specific relational schema and populating it with corrected data. The framework combines Iterative Schema Discovery (ISD) to derive minimal joinable schemas with Tabular Data Population (TDP) that uses lightweight classifiers trained on LLM hidden states for error detection. Its core novelty lies in SCAPE, a statistically calibrated conformal-prediction-based method, and SCAPE-Hyb, which integrates a conflict signal to balance accuracy and human-correction cost while maintaining coverage guarantees. Across diverse datasets, ReDD reduces extraction errors from up to 30% to below 1% and achieves 100% schema recall with high precision, enabling robust, high-stakes analytical queries over unstructured corpora. The approach offers tunable accuracy-cost trade-offs, supports human-in-the-loop intervention, and scales to large document collections with provable guarantees.
Abstract
Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.
