Understanding Chain-of-Thought Effectiveness in Code Generation: An Empirical and Information-Theoretic Analysis
Naizhu Jin, Zhong Li, Guang Yang, Tian Zhang, Qingkai Zeng
TL;DR
This work systematically compares multiple Chain-of-Thought paradigms for neural code generation through an information-theoretic lens, revealing that externally guided and structured CoT methods deliver meaningful accuracy gains with far lower token costs than deep reflective reasoning. The authors demonstrate capacity-dependent effects and cross-language generalization, showing statically typed languages benefit more from structured reasoning while dynamic languages often respond to reflective approaches. A key contribution is the I(Y;C|X) framework, which explains when and why reasoning chains reduce uncertainty in code generation. Practically, the findings guide practitioners to choose CoT strategies based on model capacity and language characteristics, enabling more efficient and reliable code synthesis at scale.
Abstract
Large language models (LLMs) achieve strong performance on code generation, but the mechanisms by which Chain-of-Thought (CoT) prompting helps remain unclear. We present a systematic empirical and information-theoretic study of CoT effectiveness in neural code generation, evaluating five paradigms (Zero-Shot, Zero-Shot CoT, Self-Planning, Structured CoT, Reasoning-CoT) across six Python benchmarks, a multilingual benchmark with 12 programming languages, and six models from 7B to 480B parameters, using conditional mutual information $I(Y;C|X)$ as a conceptual lens. Our results show that externally guided CoT consistently outperforms direct generation, with structured methods improving Pass@1 by 5--12\% on average while using substantially fewer tokens than reflective reasoning, and that CoT benefits depend on language type systems and model capacity. We further find that reasoning \emph{quality} is critical: high-quality structured CoT from strong generators yields significantly higher accuracy than lightweight alternatives with the same template, whereas naive Zero-Shot CoT can even degrade performance. These findings provide practical guidance for choosing CoT strategies based on model capacity, language characteristics, and task complexity.
