Table of Contents
Fetching ...

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan, Yannis Papakonstantinou, Reynold Cheng

TL;DR

Bird-Interact introduces a dynamic, multi-turn benchmark for interactive text-to-SQL that couples a database with a hierarchical knowledge base and a function-driven user simulator. It provides two evaluation settings, $c$-Interact and $a$-Interact, and a large task suite (Bird-Interact-Full and Bird-Interact-Lite) that spans the full CRUD spectrum with ambiguity and follow-up sub-tasks. Empirical results show state-of-the-art models struggle significantly under realistic interaction demands, with GPT-5 achieving only $8.67\%$ and $17\%$ success in the two modes, underscoring the importance of effective communication and strategic exploration. The work also introduces robust simulator design (memory grafting, UserSim-Guard) and a budget-aware evaluation framework to stress-test interactive capabilities, highlighting practical implications for deploying DB assistants in production contexts.

Abstract

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

TL;DR

Bird-Interact introduces a dynamic, multi-turn benchmark for interactive text-to-SQL that couples a database with a hierarchical knowledge base and a function-driven user simulator. It provides two evaluation settings, -Interact and -Interact, and a large task suite (Bird-Interact-Full and Bird-Interact-Lite) that spans the full CRUD spectrum with ambiguity and follow-up sub-tasks. Empirical results show state-of-the-art models struggle significantly under realistic interaction demands, with GPT-5 achieving only and success in the two modes, underscoring the importance of effective communication and strategic exploration. The work also introduces robust simulator design (memory grafting, UserSim-Guard) and a budget-aware evaluation framework to stress-test interactive capabilities, highlighting practical implications for deploying DB assistants in production contexts.

Abstract

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

Paper Structure

This paper contains 105 sections, 6 equations, 27 figures, 11 tables.

Figures (27)

  • Figure 1: Task overview of Bird-Interact showing the evaluated system interacting with DB Environment and User Simulator to complete the user task with a sequence of sub-tasks.
  • Figure 2: Knowledge chain breaking ambiguity.
  • Figure 3: Two evaluation settings for Bird-Interact: $c$-Interact, where the system engages in conversation with the user, and $a$-Interact, where the system interacts flexibly. At the end of the task, the system will receive a reward $r\in[0, 1]$.
  • Figure 4: The performance of different LLMs with different user patience on Bird-Interact-Lite. The red line denotes $a$-Interact mode (-a); the blue line denotes $c$-Interact mode (-c). And the dotted line (Idealized Performance) denotes the performance under ambiguity-free single-turn text-to-SQL.
  • Figure 5: SR of GPT-5 with memory grafting.
  • ...and 22 more figures