Exploring Multi-Table Retrieval Through Iterative Search
Allaa Boutaleb, Bernd Amann, Rafael Angarita, Hubert Naacke
TL;DR
Open-domain QA over datalakes requires retrieving and coherently composing information from multiple tables, which motivates an iterative, join-aware retrieval paradigm. The authors propose a general iterative search framework and a concrete Greedy Join-Aware Retrieval algorithm that balances relevance, coverage, and joinability via a marginal-gain utility. Empirical results on NL2SQL benchmarks show the iterative method achieves competitive retrieval performance relative to MIP-based JAR while delivering 4x–400x faster runtimes, demonstrating scalability and practicality for composition-aware retrieval. This work suggests that iterative heuristics can enable scalable, interpretable multi-table retrieval for data-lake QA and opens avenues for extending to unions and hybrid solver approaches.
Abstract
Open-domain question answering over datalakes requires retrieving and composing information from multiple tables, a challenging subtask that demands semantic relevance and structural coherence (e.g., joinability). While exact optimization methods like Mixed-Integer Programming (MIP) can ensure coherence, their computational complexity is often prohibitive. Conversely, simpler greedy heuristics that optimize for query coverage alone often fail to find these coherent, joinable sets. This paper frames multi-table retrieval as an iterative search process, arguing this approach offers advantages in scalability, interpretability, and flexibility. We propose a general framework and a concrete instantiation: a fast, effective Greedy Join-Aware Retrieval algorithm that holistically balances relevance, coverage, and joinability. Experiments across 5 NL2SQL benchmarks demonstrate that our iterative method achieves competitive retrieval performance compared to the MIP-based approach while being 4-400x faster depending on the benchmark and search space settings. This work highlights the potential of iterative heuristics for practical, scalable, and composition-aware retrieval.
