Table of Contents
Fetching ...

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

Hanbing Liu, Haoyang Li, Xiaokang Zhang, Ruotong Chen, Haiyong Xu, Tian Tian, Qi Qi, Jing Zhang

TL;DR

This work reveals that Direct Preference Optimization (DPO) alone often fails to improve Text-to-SQL models due to the absence of chain-of-thought (CoT) data in standard datasets. By generating synthetic CoT solutions and training via a three-stage pipeline (CoT synthesis, supervised fine-tuning, then DPO), the authors achieve consistent, significant gains across open-source models and benchmarks, including Spider and Bird. They demonstrate that CoT mitigates reward hacking, enhances discriminative capability, and improves training and inference scalability, providing a practical path to robust Text-to-SQL systems. The findings underscore the critical role of data quality and reasoning traceability for effective preference-based learning in complex reasoning tasks, and the work includes release of code and CoT-augmented datasets to spur further research.

Abstract

Direct Preference Optimization (DPO) has proven effective in complex reasoning tasks like math word problems and code generation. However, when applied to Text-to-SQL datasets, it often fails to improve performance and can even degrade it. Our investigation reveals the root cause: unlike math and code tasks, which naturally integrate Chain-of-Thought (CoT) reasoning with DPO, Text-to-SQL datasets typically include only final answers (gold SQL queries) without detailed CoT solutions. By augmenting Text-to-SQL datasets with synthetic CoT solutions, we achieve, for the first time, consistent and significant performance improvements using DPO. Our analysis shows that CoT reasoning is crucial for unlocking DPO's potential, as it mitigates reward hacking, strengthens discriminative capabilities, and improves scalability. These findings offer valuable insights for building more robust Text-to-SQL models. To support further research, we publicly release the code and CoT-enhanced datasets.

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

TL;DR

This work reveals that Direct Preference Optimization (DPO) alone often fails to improve Text-to-SQL models due to the absence of chain-of-thought (CoT) data in standard datasets. By generating synthetic CoT solutions and training via a three-stage pipeline (CoT synthesis, supervised fine-tuning, then DPO), the authors achieve consistent, significant gains across open-source models and benchmarks, including Spider and Bird. They demonstrate that CoT mitigates reward hacking, enhances discriminative capability, and improves training and inference scalability, providing a practical path to robust Text-to-SQL systems. The findings underscore the critical role of data quality and reasoning traceability for effective preference-based learning in complex reasoning tasks, and the work includes release of code and CoT-augmented datasets to spur further research.

Abstract

Direct Preference Optimization (DPO) has proven effective in complex reasoning tasks like math word problems and code generation. However, when applied to Text-to-SQL datasets, it often fails to improve performance and can even degrade it. Our investigation reveals the root cause: unlike math and code tasks, which naturally integrate Chain-of-Thought (CoT) reasoning with DPO, Text-to-SQL datasets typically include only final answers (gold SQL queries) without detailed CoT solutions. By augmenting Text-to-SQL datasets with synthetic CoT solutions, we achieve, for the first time, consistent and significant performance improvements using DPO. Our analysis shows that CoT reasoning is crucial for unlocking DPO's potential, as it mitigates reward hacking, strengthens discriminative capabilities, and improves scalability. These findings offer valuable insights for building more robust Text-to-SQL models. To support further research, we publicly release the code and CoT-enhanced datasets.

Paper Structure

This paper contains 55 sections, 6 equations, 16 figures, 33 tables.

Figures (16)

  • Figure 1: Model performance gains (greedy decoding) achieved by DPO over SFT (Improved Execution Accuracy, %). Chain-of-thought reasoning is crucial for unlocking DPO's potential, ensuring its effectiveness and stability.
  • Figure 2: Overview of the proposed pipeline.
  • Figure 3: Comparison of model's discriminative ability during DPO (measured by classification accuracy on curated evaluation set).
  • Figure 4: Comparison of model's self-assessed performance (average implicit reward policy model given to its own roll-outs) and real performance (EX) on Bird development set (Pass@1) during DPO training.
  • Figure 5: Model performance with different sample budget $K$ in each stage (Maj@K). Qwen2.5-7B-Instruct is used as the base model.
  • ...and 11 more figures