Table of Contents
Fetching ...

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Jiaxin Wen, Jian Guan, Hongning Wang, Wei Wu, Minlie Huang

TL;DR

This work introduces CodePlan, a scalable framework that empowers LLMs to generate and follow pseudocode -- pseudocode that outlines high-level, structured reasoning processes that enables it to scale up efficiently and improve LLM's reasoning capabilities across diverse scenarios.

Abstract

Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly rely on prompting or task-specific fine-tuning, often suffering from poor robustness and cross-task generalization. To address the limitation, we introduce CodePlan, a scalable framework that empowers LLMs to generate and follow \textit{code-form plans} -- pseudocode that outlines high-level, structured reasoning processes. By leveraging the structured and versatile nature of code, CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. Importantly, CodePlan allows automatic extraction of code-form plans from massive, wide-ranging text corpora without the need for curated, task-specific datasets. This enables it to scale up efficiently and improve LLM's reasoning capabilities across diverse scenarios. To train CodePlan, we construct a large-scale dataset of 2M examples that integrate code-form plans with standard prompt-response pairs from existing corpora. With minimal computation overhead during both training and inference, CodePlan achieves a 25.1\% relative improvement compared with directly generating responses, averaged across 13 challenging multi-step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instruction-following, multi-hop QA, and decision-making tasks. Further analysis reveals CodePlan's increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability.

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

TL;DR

This work introduces CodePlan, a scalable framework that empowers LLMs to generate and follow pseudocode -- pseudocode that outlines high-level, structured reasoning processes that enables it to scale up efficiently and improve LLM's reasoning capabilities across diverse scenarios.

Abstract

Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly rely on prompting or task-specific fine-tuning, often suffering from poor robustness and cross-task generalization. To address the limitation, we introduce CodePlan, a scalable framework that empowers LLMs to generate and follow \textit{code-form plans} -- pseudocode that outlines high-level, structured reasoning processes. By leveraging the structured and versatile nature of code, CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. Importantly, CodePlan allows automatic extraction of code-form plans from massive, wide-ranging text corpora without the need for curated, task-specific datasets. This enables it to scale up efficiently and improve LLM's reasoning capabilities across diverse scenarios. To train CodePlan, we construct a large-scale dataset of 2M examples that integrate code-form plans with standard prompt-response pairs from existing corpora. With minimal computation overhead during both training and inference, CodePlan achieves a 25.1\% relative improvement compared with directly generating responses, averaged across 13 challenging multi-step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instruction-following, multi-hop QA, and decision-making tasks. Further analysis reveals CodePlan's increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability.
Paper Structure (35 sections, 4 equations, 6 figures, 19 tables)

This paper contains 35 sections, 4 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Two examples for the mathematical reasoning task (Top) and instruction-following task (Bottom) with Mistral-7B as the base model. Words highlighted in red: Unreasonable reasoning steps; Maroon words: Conditional branches in the plan; Blue words: Iterative loops in the plan; Purple boxes: Function making in the plan; Golden underlined words: Function calling in the plan; Words highlighted in green: Essential reasoning steps in the response adhering to the plan.
  • Figure 2: EM (Left) and F1 (Right) scores on the MuSiQue benchmark. $N$-hop means that the question requires $N$ reasoning steps to answer based on knowledge in Wikipedia passages.
  • Figure 3: Performance trajectories on two downstream tasks via vanilla training and CodePlan. "4-Hop" denotes evaluating on the 4-hop subset.
  • Figure 4: Comparing natural language planning with CodePlan. The scores of each type of task are averaged across all corresponding benchmarks.
  • Figure 5: Comparing CodeReason (i.e., executable code-form reasoning) with CodePlan.
  • ...and 1 more figures