Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
Senjie Jin, Lu Chen, Zhiheng Xi, Yuhui Wang, Sirui Song, Yuhao Zhou, Xinbo Zhang, Peng Sun, Hong Lu, Tao Gui, Qi Zhang, Xuanjing Huang
TL;DR
This work tackles the challenge of jointly improving natural language CoT and program CoT for mathematical reasoning. It introduces Parrot, a three-subtask training pipeline (Information Retrieval, P-CoT Reasoning, Paradigm Conversion) with a hybrid SFT strategy and reinforcement learning, including an auxiliary reward from converted N-CoT to address sparse rewards. Through extensive experiments on SVAMP, GSM8K, and MathQA across multiple model families, Parrot yields substantial gains in N-CoT performance while maintaining strong P-CoT results, with notable improvements on MathQA. Ablation analyses validate the contribution of each subtask and the RL components, and the approach demonstrates good data efficiency and applicability to different model families, indicating broad potential for cross-paradigm enhancement in mathematical reasoning.
Abstract
Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.
