Table of Contents
Fetching ...

Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

Huan Zhang, Wei Cheng, Wei Hu

Abstract

Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.

Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

Abstract

Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.

Paper Structure

This paper contains 32 sections, 5 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: The dilemma of learning from "noisy data" generated for an intractable problem. When all self-generated solutions are flawed, any learning method (SFT/DPO) becomes futile, highlighting the need to identify and filter out such problems.
  • Figure 2: Overview of the ConSelf approach. The model generates code samples for each problem by observation-guided sampling, estimates code semantic entropy to filter problems, and fine-tunes itself on consensus-driven preference pairs.
  • Figure 3: The prompt template for observation generation and example observations generated for the count_primes(n) problem. The prompt guides the model to produce diverse insights, serving as varied conditions for code generation.
  • Figure 4: The prompt template for code generation. The model generates $n_{\text{code}}$ candidate programs conditioned on each observation to enhance solution diversity.
  • Figure 5: Comparison of the number of training examples generated by different methods for each model.
  • ...and 2 more figures