Table of Contents
Fetching ...

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

Honglin Lin, Qizhi Pei, Xin Gao, Zhuoshi Pan, Yu Li, Juntao Li, Conghui He, Lijun Wu

TL;DR

Caco presents a scalable, code-assisted framework for generating high-quality, verifiable, and diverse reasoning data by grounding steps in executable code. It unifies Code CoTs across math and algorithmic domains, trains a CodeGen model on seed traces, and applies automated execution-based verification before back-translating to natural language instruction-CoTs, yielding a 1.3M validated dataset. Empirical results show strong performance gains across multiple math benchmarks and notable cross-domain generalization, with verification and diversity identified as key drivers of improvement. The work demonstrates a self-sustaining paradigm for trustworthy reasoning in LLMs that reduces human annotation and extends to broader domains and potential RL applications.

Abstract

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco's code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

TL;DR

Caco presents a scalable, code-assisted framework for generating high-quality, verifiable, and diverse reasoning data by grounding steps in executable code. It unifies Code CoTs across math and algorithmic domains, trains a CodeGen model on seed traces, and applies automated execution-based verification before back-translating to natural language instruction-CoTs, yielding a 1.3M validated dataset. Empirical results show strong performance gains across multiple math benchmarks and notable cross-domain generalization, with verification and diversity identified as key drivers of improvement. The work demonstrates a self-sustaining paradigm for trustworthy reasoning in LLMs that reduces human annotation and extends to broader domains and potential RL applications.

Abstract

Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco's code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.

Paper Structure

This paper contains 36 sections, 35 equations, 5 figures, 21 tables.

Figures (5)

  • Figure 1: Overview of Caco results. Caco shows superior performance on Olympiad Bench and on average than baseline methods.
  • Figure 2: An overview framework of Caco data generation, including unifying Code CoT, scaling Code CoT with CodeGen, and instruction reversal and language CoT generation.
  • Figure 3: A case of one problem with its Code CoT. We demonstrate two augmentations, where problem-level augmentation refers to the original Code CoT can be back-translated into multiple question variants, and pattern-level augmentation means our CodeGen is capable of generating novel Code CoTs that generalize beyond the original seed patterns.
  • Figure 4: Left: Problem distribution of our Caco dataset and the original data sources. Right: KMeans clustering result of the problem types.
  • Figure 5: Left: Comparison of solvability and correctness between generated samples with and without verification. Middle: Accuracy comparison between models trained on verified and non-verified data. Right: Performance improvements of the Caco model as data size increases.