Table of Contents
Fetching ...

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Jiajun Wu, Jian Yang, Wei Zhang, Lin Jing, Yuqing Ma, Ensheng Shi, Yuchi Ma, Zhoujun Li, Xianglong Liu

TL;DR

UCoder reframes code generation as an unsupervised, internally supervised problem by exploiting Internal Probing of LLMs (IPC). Through a six-stage, execution-driven self-bootstrapping pipeline and an execution-driven consensus clustering mechanism, it identifies high-quality, diverse code solutions without external data and iteratively refines the model. Empirical results show that UCoder scales (7B, 14B, 32B) can match or exceed supervised baselines on multiple benchmarks, with notable gains on diverse and challenging tasks and stronger data efficiency. Analyses reveal rich internal signals guiding problem generation, solution clustering, and quality-selection dynamics, suggesting a scalable path for resource-constrained code model training and future unsupervised learning paradigms in programming tasks.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

TL;DR

UCoder reframes code generation as an unsupervised, internally supervised problem by exploiting Internal Probing of LLMs (IPC). Through a six-stage, execution-driven self-bootstrapping pipeline and an execution-driven consensus clustering mechanism, it identifies high-quality, diverse code solutions without external data and iteratively refines the model. Empirical results show that UCoder scales (7B, 14B, 32B) can match or exceed supervised baselines on multiple benchmarks, with notable gains on diverse and challenging tasks and stronger data efficiency. Analyses reveal rich internal signals guiding problem generation, solution clustering, and quality-selection dynamics, suggesting a scalable path for resource-constrained code model training and future unsupervised learning paradigms in programming tasks.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.

Paper Structure

This paper contains 37 sections, 1 theorem, 8 equations, 9 figures, 3 tables.

Key Result

Theorem 2.4

Let $R=\{r_1,\dots,r_n\}$ be $n$ candidates sampled independently from a model, and let T denote a set of unit tests. Assume that at least $k$ candidates in $R$ are functionally correct with probability at least $1-\delta$, and that any pair of incorrect implementations produces identical outputs on then the largest consensus cluster $C_{\max}$ contains only correct implementations with probabilit

Figures (9)

  • Figure 1: Comparison between supervised and unsupervised paradigms for code generation.
  • Figure 2: Overview of the proposed six-stage self-bootstrapping framework for unsupervised code generation.
  • Figure 3: Problem space probing proceeds through three stages: problem generation with function signatures and input-output contracts, difficulty rating assessment and categorization, and solution skeleton generation with implementation structure.
  • Figure 4: Lexical entropy distribution of 16,867 generated problems. Histogram with KDE shows per-problem entropy; CDF (green) and boxplot show cumulative coverage.
  • Figure 5: Complexity versus semantic coverage distribution. Color encodes density; red line shows linear trend ($r = 0.664$).
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 2.1: Execution Signature
  • Definition 2.2: Consensus Clusters
  • Definition 2.3: Quality Metrics
  • Theorem 2.4: Consensus Convergence
  • Definition 2.5: Iterative Update