Table of Contents
Fetching ...

Automated Discovery of Test Oracles for Database Management Systems Using LLMs

Qiuyang Mang, Runyuan He, Suyang Zhong, Xiaoxuan Liu, Huanchen Zhang, Alvin Cheung

TL;DR

Argus tackles the long-standing bottleneck of manually crafting test oracles for DBMS testing by introducing Constrained Abstract Queries (CAQs) and a two-stage workflow that combines LLM-driven oracle discovery with formal SQL equivalence verification. By representing oracles as equivalent CAQ pairs and validating them with an extended SQL prover, Argus achieves soundness while scaling through a reusable corpus of SQL snippets used to instantiate thousands of concrete test cases. In extensive experiments across five widely used DBMSs, Argus discovered 40 new bugs (35 logic bugs; 26 fixed), and demonstrated improvements in code and metamorphic coverage relative to state-of-the-art baselines. The results show that automating test oracle generation can significantly enhance bug detection efficiency and database reliability, while highlighting the need to manage prover limitations and LLM costs for practical deployment.

Abstract

Since 2020, automated testing for Database Management Systems (DBMSs) has flourished, uncovering hundreds of bugs in widely-used systems. A cornerstone of these techniques is test oracle, which typically implements a mechanism to generate equivalent query pairs, thereby identifying bugs by checking the consistency between their results. However, while applying these oracles can be automated, their design remains a fundamentally manual endeavor. This paper explores the use of large language models (LLMs) to automate the discovery and instantiation of test oracles, addressing a long-standing bottleneck towards fully automated DBMS testing. Although LLMs demonstrate impressive creativity, they are prone to hallucinations that can produce numerous false positive bug reports. Furthermore, their significant monetary cost and latency mean that LLM invocations should be limited to ensure that bug detection is efficient and economical. To this end, we introduce Argus, a novel framework built upon the core concept of the Constrained Abstract Query - a SQL skeleton containing placeholders and their associated instantiation conditions (e.g., requiring a placeholder to be filled by a boolean column). Argus uses LLMs to generate pairs of these skeletons that are asserted to be semantically equivalent. This equivalence is then formally proven using a SQL equivalence solver to ensure soundness. Finally, the placeholders within the verified skeletons are instantiated with concrete, reusable SQL snippets that are also synthesized by LLMs to efficiently produce complex test cases. We implemented Argus and evaluated it on five extensively tested DBMSs, discovering 40 previously unknown bugs, 35 of which are logic bugs, with 36 confirmed and 26 already fixed by the developers.

Automated Discovery of Test Oracles for Database Management Systems Using LLMs

TL;DR

Argus tackles the long-standing bottleneck of manually crafting test oracles for DBMS testing by introducing Constrained Abstract Queries (CAQs) and a two-stage workflow that combines LLM-driven oracle discovery with formal SQL equivalence verification. By representing oracles as equivalent CAQ pairs and validating them with an extended SQL prover, Argus achieves soundness while scaling through a reusable corpus of SQL snippets used to instantiate thousands of concrete test cases. In extensive experiments across five widely used DBMSs, Argus discovered 40 new bugs (35 logic bugs; 26 fixed), and demonstrated improvements in code and metamorphic coverage relative to state-of-the-art baselines. The results show that automating test oracle generation can significantly enhance bug detection efficiency and database reliability, while highlighting the need to manage prover limitations and LLM costs for practical deployment.

Abstract

Since 2020, automated testing for Database Management Systems (DBMSs) has flourished, uncovering hundreds of bugs in widely-used systems. A cornerstone of these techniques is test oracle, which typically implements a mechanism to generate equivalent query pairs, thereby identifying bugs by checking the consistency between their results. However, while applying these oracles can be automated, their design remains a fundamentally manual endeavor. This paper explores the use of large language models (LLMs) to automate the discovery and instantiation of test oracles, addressing a long-standing bottleneck towards fully automated DBMS testing. Although LLMs demonstrate impressive creativity, they are prone to hallucinations that can produce numerous false positive bug reports. Furthermore, their significant monetary cost and latency mean that LLM invocations should be limited to ensure that bug detection is efficient and economical. To this end, we introduce Argus, a novel framework built upon the core concept of the Constrained Abstract Query - a SQL skeleton containing placeholders and their associated instantiation conditions (e.g., requiring a placeholder to be filled by a boolean column). Argus uses LLMs to generate pairs of these skeletons that are asserted to be semantically equivalent. This equivalence is then formally proven using a SQL equivalence solver to ensure soundness. Finally, the placeholders within the verified skeletons are instantiated with concrete, reusable SQL snippets that are also synthesized by LLMs to efficiently produce complex test cases. We implemented Argus and evaluated it on five extensively tested DBMSs, discovering 40 previously unknown bugs, 35 of which are logic bugs, with 36 confirmed and 26 already fixed by the developers.

Paper Structure

This paper contains 46 sections, 1 theorem, 2 equations, 5 figures, 3 tables, 3 algorithms.

Key Result

theorem 1

Consider two SQL queries, $Q_1$ and $Q_2$, that are semantically equivalent on a schema $S$ containing virtual columns $\{v_i\}$. Let $Q_1'$ and $Q_2'$ be the queries created by replacing every occurrence of each virtual column $v_i$ with its corresponding expression $e_i$. If each expression $e_i$

Figures (5)

  • Figure 1: Overall pipeline of Argus.
  • Figure 2: BNF Grammar for CAQ Pairs, where we omit the full SQL grammar for simplicity.
  • Figure 3: Code coverage achieved by Argus, SQLancer, and SQLancer++ on DuckDB and PostgreSQL over 24-hour runs.
  • Figure 4: Number of unique logic bugs found by different sets of oracles within a 6-hour testing on Dolt.
  • Figure 5: Cost and efficiency comparison between Argus and the naive baseline on Dolt over a 1-hour run.

Theorems & Definitions (3)

  • Definition 4.1: CAQ Instantiation
  • Definition 4.2: Equivalent CAQ Pair
  • theorem 1: Expression Instantiation Preserves Equivalence