Table of Contents
Fetching ...

Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques

Guifeng Wang, Yuanfeng Song, Meng Yang, Tao Zhu, Xiaoming Yin, Xing Chen

TL;DR

This work tackles evaluation and reward bottlenecks in text-to-SQL by introducing RuCo-C, a rubric-integrated generative judge that produces query-specific, interpretable critiques without requiring gold SQL. It combines rubric-based critique response generation with a two-phase training regime: rubric-aligned supervised fine-tuning and reinforcement learning using Group Relative Policy Optimization. A novel reward design fuses outcome and process signals, including a verification mechanism to handle noisy data, yielding denser feedback and more stable RL. Empirical results on Spider, BIRD, and Spider-DK demonstrate substantial improvements over baselines, highlighting RuCo-C’s potential to democratize scalable, fine-grained evaluation and guide RL-based text-to-SQL training.

Abstract

Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.

Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques

TL;DR

This work tackles evaluation and reward bottlenecks in text-to-SQL by introducing RuCo-C, a rubric-integrated generative judge that produces query-specific, interpretable critiques without requiring gold SQL. It combines rubric-based critique response generation with a two-phase training regime: rubric-aligned supervised fine-tuning and reinforcement learning using Group Relative Policy Optimization. A novel reward design fuses outcome and process signals, including a verification mechanism to handle noisy data, yielding denser feedback and more stable RL. Empirical results on Spider, BIRD, and Spider-DK demonstrate substantial improvements over baselines, highlighting RuCo-C’s potential to democratize scalable, fine-grained evaluation and guide RL-based text-to-SQL training.

Abstract

Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.

Paper Structure

This paper contains 30 sections, 6 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An example of the critique response illustrates the designed output format, which will be used for reward score in the RL training of the critique model.
  • Figure 2: The working pipeline of our solution. It first automatically generates rubric-integrated critique response via a multi-agent framework. Then, it integrates densified reward feedbacks through a consistency mechanism combined with weighting strategy during RL training, dynamically adjusting rewards to enhance model performance.
  • Figure 3: Box and kernel density plots comparing the distributions of ground truth labels and reward function under the EX (left) and our RuCo-C method (right).
  • Figure 4: The false positive example of EX scoring (EX score is 1 while the true label is 0), along with the generated critique response by our RuCo-C to indicate the correct judgment.
  • Figure 5: The reward score trends on the training set (left) and validation set (right) during GRPO training steps. The red curve corresponds to baseline model based on the EX reward, while the green curve indicates to our RuCo-C model using designed reward function. Both model are training with Qwen2.5-Coder-7B-Instruct backbone under the GRPO framework. Despite starting from a relatively low initial score, our RuCo-C model exhibits a distinct upward trend, indicating that the feedback obtained with increasing training steps yields significant optimization for the model.
  • ...and 4 more figures