Table of Contents
Fetching ...

Understanding Chain-of-Thought Effectiveness in Code Generation: An Empirical and Information-Theoretic Analysis

Naizhu Jin, Zhong Li, Guang Yang, Tian Zhang, Qingkai Zeng

TL;DR

This work systematically compares multiple Chain-of-Thought paradigms for neural code generation through an information-theoretic lens, revealing that externally guided and structured CoT methods deliver meaningful accuracy gains with far lower token costs than deep reflective reasoning. The authors demonstrate capacity-dependent effects and cross-language generalization, showing statically typed languages benefit more from structured reasoning while dynamic languages often respond to reflective approaches. A key contribution is the I(Y;C|X) framework, which explains when and why reasoning chains reduce uncertainty in code generation. Practically, the findings guide practitioners to choose CoT strategies based on model capacity and language characteristics, enabling more efficient and reliable code synthesis at scale.

Abstract

Large language models (LLMs) achieve strong performance on code generation, but the mechanisms by which Chain-of-Thought (CoT) prompting helps remain unclear. We present a systematic empirical and information-theoretic study of CoT effectiveness in neural code generation, evaluating five paradigms (Zero-Shot, Zero-Shot CoT, Self-Planning, Structured CoT, Reasoning-CoT) across six Python benchmarks, a multilingual benchmark with 12 programming languages, and six models from 7B to 480B parameters, using conditional mutual information $I(Y;C|X)$ as a conceptual lens. Our results show that externally guided CoT consistently outperforms direct generation, with structured methods improving Pass@1 by 5--12\% on average while using substantially fewer tokens than reflective reasoning, and that CoT benefits depend on language type systems and model capacity. We further find that reasoning \emph{quality} is critical: high-quality structured CoT from strong generators yields significantly higher accuracy than lightweight alternatives with the same template, whereas naive Zero-Shot CoT can even degrade performance. These findings provide practical guidance for choosing CoT strategies based on model capacity, language characteristics, and task complexity.

Understanding Chain-of-Thought Effectiveness in Code Generation: An Empirical and Information-Theoretic Analysis

TL;DR

This work systematically compares multiple Chain-of-Thought paradigms for neural code generation through an information-theoretic lens, revealing that externally guided and structured CoT methods deliver meaningful accuracy gains with far lower token costs than deep reflective reasoning. The authors demonstrate capacity-dependent effects and cross-language generalization, showing statically typed languages benefit more from structured reasoning while dynamic languages often respond to reflective approaches. A key contribution is the I(Y;C|X) framework, which explains when and why reasoning chains reduce uncertainty in code generation. Practically, the findings guide practitioners to choose CoT strategies based on model capacity and language characteristics, enabling more efficient and reliable code synthesis at scale.

Abstract

Large language models (LLMs) achieve strong performance on code generation, but the mechanisms by which Chain-of-Thought (CoT) prompting helps remain unclear. We present a systematic empirical and information-theoretic study of CoT effectiveness in neural code generation, evaluating five paradigms (Zero-Shot, Zero-Shot CoT, Self-Planning, Structured CoT, Reasoning-CoT) across six Python benchmarks, a multilingual benchmark with 12 programming languages, and six models from 7B to 480B parameters, using conditional mutual information as a conceptual lens. Our results show that externally guided CoT consistently outperforms direct generation, with structured methods improving Pass@1 by 5--12\% on average while using substantially fewer tokens than reflective reasoning, and that CoT benefits depend on language type systems and model capacity. We further find that reasoning \emph{quality} is critical: high-quality structured CoT from strong generators yields significantly higher accuracy than lightweight alternatives with the same template, whereas naive Zero-Shot CoT can even degrade performance. These findings provide practical guidance for choosing CoT strategies based on model capacity, language characteristics, and task complexity.

Paper Structure

This paper contains 48 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Token Cost Comparison.
  • Figure 2: Logic Divergence. The 7B model truncates reasoning before semantic closure, omitting the final validation check. The 32B model completes the logic (if second_smallest == float('inf'): return None), demonstrating capacity-dependent reasoning completeness.
  • Figure 3: Type/Boundary Error. GPT-3.5 performs unsafe indexing (lst[index] = lst[index-1]), exceeding array bounds. GPT-5 expands the list before insertion (lst.append(None)), avoiding runtime violations through superior structural reasoning.
  • Figure 4: Reasoning-Execution Misalignment. The 32B model introduces advanced optimization (math.gcd()) in reasoning but fails to import the module, causing NameError. The 7B model maintains simpler reasoning-execution alignment with explicit from math import gcd.
  • Figure 5: Detailed results for RQ4.