A System and Benchmark for LLM-based Q&A on Heterogeneous Data
Achille Fokoue, Srideepika Jayaraman, Elham Khabiri, Jeffrey O. Kephart, Yingjie Li, Dhruv Shah, Youssef Drissi, Fenno F. Heath, Anu Bhamidipaty, Fateh A. Tipu, Robert J. Baseman
TL;DR
This work tackles NL question answering over heterogeneous industrial data sources by introducing siwarex, a framework that unifies databases and APIs through a relational schema where APIs appear as virtual tables and are invoked via user-defined functions. It combines a ReAct-based NL-to-SQL pipeline, a Table Selector, a Query Rewriter, and guardrails to ensure correct routing and execution, enabling seamless API and DB interactions. To evaluate performance under varying data heterogeneity, the authors extend the Spider benchmark by replacing a configurable fraction of DB tables with API proxies, creating a spectrum from pure DB access to pure API access. Experimental results show siwarex maintains higher execution accuracy than a strong API+DB baseline as heterogeneity increases, demonstrating practical viability for industry-grade NL Q&A over mixed data sources. The work also commits to releasing the modified Spider benchmark to foster further research in heterogeneous data access using LLMs.
Abstract
In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community
