Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation
Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li
TL;DR
This work addresses the lack of a comprehensive benchmark for evaluating LLM-based Database Question Answering (DBQA) systems and their modular components. It introduces DQABench, a bilingual dataset with over 200k QA pairs spanning general, product-specific, and instance-specific questions, and DQATestbed, a plug-and-play framework incorporating pre-training, fine-tuning, Question Classification Routing (QCR), Prompt Template Engineering (PTE), Retrieval Augmented Generation (RAG), and Tool Invocation Generation (TIG) for end-to-end DBQA evaluation. Through extensive experiments on nine LLMs (including Baichuan2 variants) and analysis of modular components, the paper reveals that model size, domain-specific continual training, and the effectiveness of RAG and TIG substantially shape DBQA performance, while retrieval recall and tool invocation capability remain key bottlenecks. The findings provide practical guidance for deploying LLMs in DBQA, highlighting when to rely on pre-training, how to route questions, and how to integrate external knowledge and tools to achieve robust, accurate, and scalable database question answering.
Abstract
The development of Large Language Models (LLMs) has revolutionized QA across various industries, including the database domain. However, there is still a lack of a comprehensive benchmark to evaluate the capabilities of different LLMs and their modular components in database QA. To this end, we introduce DQABench, the first comprehensive database QA benchmark for LLMs. DQABench features an innovative LLM-based method to automate the generation, cleaning, and rewriting of evaluation dataset, resulting in over 200,000 QA pairs in English and Chinese, separately. These QA pairs cover a wide range of database-related knowledge extracted from manuals, online communities, and database instances. This inclusion allows for an additional assessment of LLMs' Retrieval-Augmented Generation (RAG) and Tool Invocation Generation (TIG) capabilities in the database QA task. Furthermore, we propose a comprehensive LLM-based database QA testbed DQATestbed. This testbed is highly modular and scalable, with basic and advanced components such as Question Classification Routing (QCR), RAG, TIG, and Prompt Template Engineering (PTE). Moreover, DQABench provides a comprehensive evaluation pipeline that computes various metrics throughout a standardized evaluation process to ensure the accuracy and fairness of the evaluation. We use DQABench to evaluate the database QA capabilities under the proposed testbed comprehensively. The evaluation reveals findings like (i) the strengths and limitations of nine LLM-based QA bots and (ii) the performance impact and potential improvements of various service components (e.g., QCR, RAG, TIG). Our benchmark and findings will guide the future development of LLM-based database QA research.
