Table of Contents
Fetching ...

AnCoder: Anchored Code Generation via Discrete Diffusion Models

Anton Xue, Litu Rout, Constantine Caramanis, Sanjay Shakkottai

TL;DR

This paper addresses the fragility of diffusion-based code generation by introducing AnchorTree, a hierarchical soft anchoring framework that uses the AST to prioritize learning and denoising of syntactically and semantically important tokens. By coupling an anchor network with a denoiser in a two-stage architecture and supervising an Anchored Negative ELBO, AnCoder learns to respect code structure, resulting in higher syntactic validity and functional correctness on HumanEval and MBPP, outperforming diffusion baselines of similar scale. The work provides a parameter-efficient path to improved code generation and highlights the value of structural priors in diffusional modeling of code, with potential applicability to broader structured tasks. Overall, AnchorTree demonstrates that leveraging hierarchical software representations can significantly reduce the gap between diffusion models and autoregressive baselines in producing executable, high-quality code.

Abstract

Diffusion language models offer a compelling alternative to autoregressive code generation, enabling global planning and iterative refinement of complex program logic. However, existing approaches fail to respect the rigid structure of programming languages and, as a result, often produce broken programs that fail to execute. To address this, we introduce AnchorTree, a framework that explicitly anchors the diffusion process using structured, hierarchical priors native to code. Specifically, AnchorTree uses the abstract syntax tree to prioritize resolving syntactically and semantically salient tokens, such as keywords (e.g., if, while) and identifiers (e.g., variable names), thereby establishing a structural scaffold that guides the remaining generation. We validate this framework via AnCoder, a family of models showing that structurally anchored diffusion offers a parameter-efficient path to high-quality code generation.

AnCoder: Anchored Code Generation via Discrete Diffusion Models

TL;DR

This paper addresses the fragility of diffusion-based code generation by introducing AnchorTree, a hierarchical soft anchoring framework that uses the AST to prioritize learning and denoising of syntactically and semantically important tokens. By coupling an anchor network with a denoiser in a two-stage architecture and supervising an Anchored Negative ELBO, AnCoder learns to respect code structure, resulting in higher syntactic validity and functional correctness on HumanEval and MBPP, outperforming diffusion baselines of similar scale. The work provides a parameter-efficient path to improved code generation and highlights the value of structural priors in diffusional modeling of code, with potential applicability to broader structured tasks. Overall, AnchorTree demonstrates that leveraging hierarchical software representations can significantly reduce the gap between diffusion models and autoregressive baselines in producing executable, high-quality code.

Abstract

Diffusion language models offer a compelling alternative to autoregressive code generation, enabling global planning and iterative refinement of complex program logic. However, existing approaches fail to respect the rigid structure of programming languages and, as a result, often produce broken programs that fail to execute. To address this, we introduce AnchorTree, a framework that explicitly anchors the diffusion process using structured, hierarchical priors native to code. Specifically, AnchorTree uses the abstract syntax tree to prioritize resolving syntactically and semantically salient tokens, such as keywords (e.g., if, while) and identifiers (e.g., variable names), thereby establishing a structural scaffold that guides the remaining generation. We validate this framework via AnCoder, a family of models showing that structurally anchored diffusion offers a parameter-efficient path to high-quality code generation.
Paper Structure (26 sections, 11 equations, 6 figures, 5 tables)

This paper contains 26 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: AnCoder: Anchored Code Generation. Generation begins with conditioning, where the input prompt is mask-padded. During the anchoring phase, syntactically and semantically salient tokens such as keywords and identifiers are unmasked according to their syntactic hierarchy. Given these anchors, the model can then better unmask the remaining tokens in the final resolution phase. Denoising in this order yields code with higher functional correctness than diffusion language model baselines.
  • Figure 2: ADLM forward pass. Given a partially masked $z_t$ (top), the anchor network $y_{\theta_A}$ focuses on unmasking the anchor token numbers (middle), which helps the denoiser network $x_{\theta_D}$ unmask the remaining tokens, such as num (bottom). Anchor tokens are labeled prior to training (see \ref{['sec:experiments']}).
  • Figure 3: The syntactic hierarchy of code. A sequence of source code tokens (left) is parsed into an AST (right). Here, a single assignment statement forms a subtree nested within the broader program context (denoted by ...). AST nodes are labeled with the type, data, and relative character spans (e.g., [5:10]) that map high-level syntactic constructs onto their sequential token positions.
  • Figure 4: AST-based ordering and the AnchorTree weight $\mu(l) = \omega(l) \cdot \eta(l)$. (Left) Example of an ascending chain $l_0 \preceq l_1 \preceq \cdots \preceq l_4$ starting at mid and ordered by the AST hierarchy. (Middle) We take keywords (blue) and identifiers (orange) to be the anchors, where let $\omega(l) = 1$. (Right) Each position is weighted by the AST depth, where let $\eta (l') \geq \eta(l)$ if $l' \succeq l$ in the AST-based partial ordering.
  • Figure 5: Unmasking by AST ancestry improves performance. Cumulatively unmasking a target position $l_0$'s AST ancestors $l_1, l_2, \ldots$ improves performance more than simply unmasking other tokens at random. Moreover, performance improves the fastest when we reveal AST ancestors closest to the target first (in-out). This trend holds at both moderate-noise ($t = 0.85$) and high-noise ($t = 0.95$) settings.
  • ...and 1 more figures