Table of Contents
Fetching ...

Design of Chain-of-Thought in Math Problem Solving

Zhanming Jie, Trung Quoc Luong, Xinbo Zhang, Xiaoran Jin, Hang Li

TL;DR

The paper rigorously compares natural-language and program-based chain-of-thought designs for math problem solving across Python and Wolfram, using three datasets and three CoT variants. It demonstrates that program CoTs, particularly self-describing (SDP) forms, generally outperform NL CoTs, with 30B-scale models further amplifying gains when combined with supervised fine-tuning, reward-model reranking, and majority voting. Key findings show Python-based program CoTs often excel, and diversity from SDP aids performance via voting and reranking, while execution-based verification provides reliability not available to NL CoTs. The work provides practical guidelines for CoT design, highlighting the value of ensembling CoT types and cross-language experimentation, with datasets and code publicly available.

Abstract

Chain-of-Thought (CoT) plays a crucial role in reasoning for math problem solving. We conduct a comprehensive examination of methods for designing CoT, comparing conventional natural language CoT with various program CoTs, including the self-describing program, the comment-describing program, and the non-describing program. Furthermore, we investigate the impact of programming language on program CoTs, comparing Python and Wolfram Language. Through extensive experiments on GSM8K, MATHQA, and SVAMP, we find that program CoTs often have superior effectiveness in math problem solving. Notably, the best performing combination with 30B parameters beats GPT-3.5-turbo by a significant margin. The results show that self-describing program offers greater diversity and thus can generally achieve higher performance. We also find that Python is a better choice of language than Wolfram for program CoTs. The experimental results provide a valuable guideline for future CoT designs that take into account both programming language and coding style for further advancements. Our datasets and code are publicly available.

Design of Chain-of-Thought in Math Problem Solving

TL;DR

The paper rigorously compares natural-language and program-based chain-of-thought designs for math problem solving across Python and Wolfram, using three datasets and three CoT variants. It demonstrates that program CoTs, particularly self-describing (SDP) forms, generally outperform NL CoTs, with 30B-scale models further amplifying gains when combined with supervised fine-tuning, reward-model reranking, and majority voting. Key findings show Python-based program CoTs often excel, and diversity from SDP aids performance via voting and reranking, while execution-based verification provides reliability not available to NL CoTs. The work provides practical guidelines for CoT design, highlighting the value of ensembling CoT types and cross-language experimentation, with datasets and code publicly available.

Abstract

Chain-of-Thought (CoT) plays a crucial role in reasoning for math problem solving. We conduct a comprehensive examination of methods for designing CoT, comparing conventional natural language CoT with various program CoTs, including the self-describing program, the comment-describing program, and the non-describing program. Furthermore, we investigate the impact of programming language on program CoTs, comparing Python and Wolfram Language. Through extensive experiments on GSM8K, MATHQA, and SVAMP, we find that program CoTs often have superior effectiveness in math problem solving. Notably, the best performing combination with 30B parameters beats GPT-3.5-turbo by a significant margin. The results show that self-describing program offers greater diversity and thus can generally achieve higher performance. We also find that Python is a better choice of language than Wolfram for program CoTs. The experimental results provide a valuable guideline for future CoT designs that take into account both programming language and coding style for further advancements. Our datasets and code are publicly available.
Paper Structure (29 sections, 4 figures, 9 tables)

This paper contains 29 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Examples of CoT representations: Natural Language (NL) CoT, Comment-Describing Program (CDP) and Self-Describing Program (SDP) in both Wolfram and Python.
  • Figure 2: Overview of data collection, with CDP as an example.
  • Figure 3: Majority voting regarding the different number of sampled instances (Left: $6.7$B; Right: $30$B). We just depict the performance in Python for illustration purposes.
  • Figure 4: The percentage of failure cases that are correctly predicted in different CoT types.