DBRouting: Routing End User Queries to Databases for Answerability
Priyangshu Mandal, Manasi Patwardhan, Mayur Patidar, Lovekesh Vig
TL;DR
DBRouting tackles the problem of routing natural-language queries to the correct enterprise databases to enable accurate answer retrieval. The authors construct two benchmarks, Spider-Route and Bird-Route, by extending NL-to-SQL datasets (Spider and BirdSQL) to study end-user query routing across multiple DBs, including domain metadata and vertical clustering. They compare three approaches—Llama3 LM ranking, pre-trained embedding similarity, and fine-tuned task-specific SBERT embeddings—finding that task-specific embeddings generally outperform pre-trained baselines, with Llama3 offering strong performance on certain Spider-Route splits but limited by token length. The study shows that more data sources, higher domain similarity among sources, and lack of external metadata increase difficulty, while meta-data augmentation helps (notably in Bird-Route). Overall, the work provides practical baselines, reveals key challenges, and motivates more sophisticated routing methods to handle large, heterogeneous enterprise data sources across domains.
Abstract
Enterprise level data is often distributed across multiple sources and identifying the correct set-of data-sources with relevant information for a knowledge request is a fundamental challenge. In this work, we define the novel task of routing an end-user query to the appropriate data-source, where the data-sources are databases. We synthesize datasets by extending existing datasets designed for NL-to-SQL semantic parsing. We create baselines on these datasets by using open-source LLMs, using both pre-trained and task specific embeddings fine-tuned using the training data. With these baselines we demonstrate that open-source LLMs perform better than embedding based approach, but suffer from token length limitations. Embedding based approaches benefit from task specific fine-tuning, more so when there is availability of data in terms of database specific questions for training. We further find that the task becomes more difficult (i) with an increase in the number of data-sources, (ii) having data-sources closer in terms of their domains,(iii) having databases without external domain knowledge required to interpret its entities and (iv) with ambiguous and complex queries requiring more fine-grained understanding of the data-sources or logical reasoning for routing to an appropriate source. This calls for the need for developing more sophisticated solutions to better address the task.
