Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation
Xunzhu Tang, Iyiola Emmanuel Olatunji, Tiezhu Sun, Jacques Klein, Tegawende F. Bissyande
TL;DR
The paper addresses the brittleness and inefficiency of reasoning-based code generation by combining concise Chain-of-Draft prompts with a reinforcement-learning–driven candidate selector. It introduces Multi-CoD, which treats solution selection as a contextual bandit problem and uses a Value-Advantage Decomposition Network (VADN) to choose among multiple CoD-generated candidates based on interpretable features. The approach achieves competitive or superior performance across MBPP, BigCodeBench, SWE-bench Verified, and Defects4J while materially reducing token costs and enabling faster per-solution generation, improving sustainability and scalability for real-world deployments. These results demonstrate that strategic diversity, interpretable feature-based selection, and cost-aware prompting can close gaps between open- and closed-source models on complex programming tasks.
Abstract
LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks requiring correctness and semantic alignment. While Chain-of-Thought (CoT) prompting enhances reasoning through intermediate steps, it suffers from verbosity and inefficiency. Chain-of-Draft (CoD) prompting offers more concise reasoning, but the stochastic nature of LLMs produces varying solution quality, making optimal selection challenging. We propose \multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions. Our approach uses strategy-guided prompting to encourage diverse reasoning styles and models solution selection as a contextual bandit problem. The framework optimizes interpretable features including code complexity, reasoning structure, and strategic metadata through a reward function balancing correctness, efficiency, and clarity. Experiments on MBPP, BigCodeBench, SWE-bench Verified, and Defects4J show \multicod~outperforms and in some cases, on par with standard prompting, CoT, and CoD baselines while achieving cost and token efficiency from the user's perspective through a multi-candidate design that charges only for the selected output, reducing user billing by over 50\% and improving LLM response quality, making \multicod~more sustainable and scalable for real-world deployment. Our code is available: https://anonymous.4open.science/r/MultiCoD.
