Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

Hanbing Liu; Haoyang Li; Xiaokang Zhang; Ruotong Chen; Haiyong Xu; Tian Tian; Qi Qi; Jing Zhang

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

Hanbing Liu, Haoyang Li, Xiaokang Zhang, Ruotong Chen, Haiyong Xu, Tian Tian, Qi Qi, Jing Zhang

TL;DR

This work reveals that Direct Preference Optimization (DPO) alone often fails to improve Text-to-SQL models due to the absence of chain-of-thought (CoT) data in standard datasets. By generating synthetic CoT solutions and training via a three-stage pipeline (CoT synthesis, supervised fine-tuning, then DPO), the authors achieve consistent, significant gains across open-source models and benchmarks, including Spider and Bird. They demonstrate that CoT mitigates reward hacking, enhances discriminative capability, and improves training and inference scalability, providing a practical path to robust Text-to-SQL systems. The findings underscore the critical role of data quality and reasoning traceability for effective preference-based learning in complex reasoning tasks, and the work includes release of code and CoT-augmented datasets to spur further research.

Abstract

Direct Preference Optimization (DPO) has proven effective in complex reasoning tasks like math word problems and code generation. However, when applied to Text-to-SQL datasets, it often fails to improve performance and can even degrade it. Our investigation reveals the root cause: unlike math and code tasks, which naturally integrate Chain-of-Thought (CoT) reasoning with DPO, Text-to-SQL datasets typically include only final answers (gold SQL queries) without detailed CoT solutions. By augmenting Text-to-SQL datasets with synthetic CoT solutions, we achieve, for the first time, consistent and significant performance improvements using DPO. Our analysis shows that CoT reasoning is crucial for unlocking DPO's potential, as it mitigates reward hacking, strengthens discriminative capabilities, and improves scalability. These findings offer valuable insights for building more robust Text-to-SQL models. To support further research, we publicly release the code and CoT-enhanced datasets.

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

TL;DR

Abstract

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)