Table of Contents
Fetching ...

Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation

Xunzhu Tang, Iyiola Emmanuel Olatunji, Tiezhu Sun, Jacques Klein, Tegawende F. Bissyande

TL;DR

The paper addresses the brittleness and inefficiency of reasoning-based code generation by combining concise Chain-of-Draft prompts with a reinforcement-learning–driven candidate selector. It introduces Multi-CoD, which treats solution selection as a contextual bandit problem and uses a Value-Advantage Decomposition Network (VADN) to choose among multiple CoD-generated candidates based on interpretable features. The approach achieves competitive or superior performance across MBPP, BigCodeBench, SWE-bench Verified, and Defects4J while materially reducing token costs and enabling faster per-solution generation, improving sustainability and scalability for real-world deployments. These results demonstrate that strategic diversity, interpretable feature-based selection, and cost-aware prompting can close gaps between open- and closed-source models on complex programming tasks.

Abstract

LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks requiring correctness and semantic alignment. While Chain-of-Thought (CoT) prompting enhances reasoning through intermediate steps, it suffers from verbosity and inefficiency. Chain-of-Draft (CoD) prompting offers more concise reasoning, but the stochastic nature of LLMs produces varying solution quality, making optimal selection challenging. We propose \multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions. Our approach uses strategy-guided prompting to encourage diverse reasoning styles and models solution selection as a contextual bandit problem. The framework optimizes interpretable features including code complexity, reasoning structure, and strategic metadata through a reward function balancing correctness, efficiency, and clarity. Experiments on MBPP, BigCodeBench, SWE-bench Verified, and Defects4J show \multicod~outperforms and in some cases, on par with standard prompting, CoT, and CoD baselines while achieving cost and token efficiency from the user's perspective through a multi-candidate design that charges only for the selected output, reducing user billing by over 50\% and improving LLM response quality, making \multicod~more sustainable and scalable for real-world deployment. Our code is available: https://anonymous.4open.science/r/MultiCoD.

Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation

TL;DR

The paper addresses the brittleness and inefficiency of reasoning-based code generation by combining concise Chain-of-Draft prompts with a reinforcement-learning–driven candidate selector. It introduces Multi-CoD, which treats solution selection as a contextual bandit problem and uses a Value-Advantage Decomposition Network (VADN) to choose among multiple CoD-generated candidates based on interpretable features. The approach achieves competitive or superior performance across MBPP, BigCodeBench, SWE-bench Verified, and Defects4J while materially reducing token costs and enabling faster per-solution generation, improving sustainability and scalability for real-world deployments. These results demonstrate that strategic diversity, interpretable feature-based selection, and cost-aware prompting can close gaps between open- and closed-source models on complex programming tasks.

Abstract

LLMs demonstrate surface-level fluency in code generation but struggle with structured reasoning tasks requiring correctness and semantic alignment. While Chain-of-Thought (CoT) prompting enhances reasoning through intermediate steps, it suffers from verbosity and inefficiency. Chain-of-Draft (CoD) prompting offers more concise reasoning, but the stochastic nature of LLMs produces varying solution quality, making optimal selection challenging. We propose \multicod, a reinforcement learning framework that learns to select the most promising candidate from CoD-generated solutions. Our approach uses strategy-guided prompting to encourage diverse reasoning styles and models solution selection as a contextual bandit problem. The framework optimizes interpretable features including code complexity, reasoning structure, and strategic metadata through a reward function balancing correctness, efficiency, and clarity. Experiments on MBPP, BigCodeBench, SWE-bench Verified, and Defects4J show \multicod~outperforms and in some cases, on par with standard prompting, CoT, and CoD baselines while achieving cost and token efficiency from the user's perspective through a multi-candidate design that charges only for the selected output, reducing user billing by over 50\% and improving LLM response quality, making \multicod~more sustainable and scalable for real-world deployment. Our code is available: https://anonymous.4open.science/r/MultiCoD.

Paper Structure

This paper contains 35 sections, 15 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison of BLEU scores across different solution drafts generated with identical prompting conditions. Draft 2 achieves the highest quality with a score of 0.75, while other drafts show considerable variance in performance.
  • Figure 2: Overview of the Multi-CoD framework. We generate $k$ CoD-guided solutions with diverse strategies and select the best using a learned reinforcement learning selector.
  • Figure 4: Performance comparison of prompting strategies across different foundation models on BigCodeBench (Pass@1). Each axis represents a different prompting strategy, with values increasing outward. Solid lines represent closed-source models, while dashed lines represent open-source models.
  • Figure 5: MBPP benchmark performance by model and prompting strategy. Each panel shows a different foundation model's accuracy progression across four prompting methods: Standard Prompting, Chain-of-Thought, Chain-of-Draft, and Multi-CoD. Blue shades represent increasingly sophisticated prompting strategies.
  • Figure 6: SWE-bench Verified performance by model and prompting strategy. Each panel shows a different foundation model's resolution rate progression across four prompting methods: Standard Prompting, Chain-of-Thought, Chain-of-Draft, and Multi-CoD. The most dramatic improvement is observed in Qwen2.5-Coder-32B.
  • ...and 3 more figures