Table of Contents
Fetching ...

Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Senjie Jin, Lu Chen, Zhiheng Xi, Yuhui Wang, Sirui Song, Yuhao Zhou, Xinbo Zhang, Peng Sun, Hong Lu, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work tackles the challenge of jointly improving natural language CoT and program CoT for mathematical reasoning. It introduces Parrot, a three-subtask training pipeline (Information Retrieval, P-CoT Reasoning, Paradigm Conversion) with a hybrid SFT strategy and reinforcement learning, including an auxiliary reward from converted N-CoT to address sparse rewards. Through extensive experiments on SVAMP, GSM8K, and MathQA across multiple model families, Parrot yields substantial gains in N-CoT performance while maintaining strong P-CoT results, with notable improvements on MathQA. Ablation analyses validate the contribution of each subtask and the RL components, and the approach demonstrates good data efficiency and applicability to different model families, indicating broad potential for cross-paradigm enhancement in mathematical reasoning.

Abstract

Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

TL;DR

This work tackles the challenge of jointly improving natural language CoT and program CoT for mathematical reasoning. It introduces Parrot, a three-subtask training pipeline (Information Retrieval, P-CoT Reasoning, Paradigm Conversion) with a hybrid SFT strategy and reinforcement learning, including an auxiliary reward from converted N-CoT to address sparse rewards. Through extensive experiments on SVAMP, GSM8K, and MathQA across multiple model families, Parrot yields substantial gains in N-CoT performance while maintaining strong P-CoT results, with notable improvements on MathQA. Ablation analyses validate the contribution of each subtask and the RL components, and the approach demonstrates good data efficiency and applicability to different model families, indicating broad potential for cross-paradigm enhancement in mathematical reasoning.

Abstract

Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

Paper Structure

This paper contains 54 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The histogram of error types. The labels on the x-axis are defined in section \ref{['Define']}, while OE denotes Other Errors. Results from SFT are shaded in light colors, and Parrot SFT results are presented in dark colors.
  • Figure 2: The training pipeline and methods of Parrot. On the left, the pipeline consists of three subtasks: Information Retrieval, P-CoT Reasoning, and Paradigm Conversion. By these subtasks, the model sequentially generates P-CoT and n-CoT. On the right, we use a Hybrid Supervised Fine-Tuning (SFT) strategy to enable semantic transfer and incorporate reinforced algorithms for further improvements. The detailed Parrot inference process and subtask prompts are provided in Appendix \ref{['sub prom']}.
  • Figure 3: The training accuracy of LLaMA-2-7B on SVAMP for P-CoT RL. The left figure is without the converted penalty while the right is with the penalty.
  • Figure 4: The results of performing SFT using the original N-CoT and the converted N-CoT data. In the left, SFT represents the original data size, while Equ. SFT refers to randomly expanding data to match the scale of the On-SL collections. Parrot denotes the collected data. We collect the correct N-CoT data from 3 epoch Parrot On-SL and perform supervised training after deduplicating.
  • Figure 5: The examples and analysis of Comprehension Error (CE), Calculation Error (CA) and Logic Inconsistency (LI) in N-CoT SFT model reasoning outputs.
  • ...and 2 more figures