Table of Contents
Fetching ...

CoDA: Coding LM via Diffusion Adaptation

Haolin Chen, Shiyu Wang, Can Qin, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao

TL;DR

CoDA advances diffusion-based code generation by delivering a compact $1.7$B diffusion coder built on the Qwen3 backbone and trained via a fully open TPU pipeline, enabling bidirectional decoding and infilling at interactive latency. The approach combines large-scale general pre-training (~$180$B tokens) with code-focused mid-training (~$20$B tokens) and instruction-tuned post-training, bridged by a progressive masking schedule to align training and inference. Empirical results on Humaneval, MBPP, and EvalPlus show CoDA-1.7B-Instruct matching or surpassing diffusion models up to $7$B parameters, while remaining competitive with autoregressive baselines at similar sizes. The work releases model checkpoints, evaluation harnesses, and end-to-end TPU training tooling to accelerate research into lightweight diffusion-based coding assistants and future hybrid decoding strategies.

Abstract

Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.

CoDA: Coding LM via Diffusion Adaptation

TL;DR

CoDA advances diffusion-based code generation by delivering a compact B diffusion coder built on the Qwen3 backbone and trained via a fully open TPU pipeline, enabling bidirectional decoding and infilling at interactive latency. The approach combines large-scale general pre-training (~B tokens) with code-focused mid-training (~B tokens) and instruction-tuned post-training, bridged by a progressive masking schedule to align training and inference. Empirical results on Humaneval, MBPP, and EvalPlus show CoDA-1.7B-Instruct matching or surpassing diffusion models up to B parameters, while remaining competitive with autoregressive baselines at similar sizes. The work releases model checkpoints, evaluation harnesses, and end-to-end TPU training tooling to accelerate research into lightweight diffusion-based coding assistants and future hybrid decoding strategies.

Abstract

Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.

Paper Structure

This paper contains 22 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The masking distribution during different stages. Green tiles represent text tokens, blue tiles are mask tokens, and tiles with red border lines indicate tokens that are conditioned not to be masked. During pre-training or mid-training, masking is random. In the post-training stage, a structured masking strategy is applied. For inference, the model is conditioned on a prefix to perform infilling.
  • Figure 2: A visualization of the masking schedule. S1: a randomly chosen prefix is conditioned and unmaskable; S2: a randomly chosen suffix is replaced with the pad token and made unmaskable; S3: A block masking of size $k=2$.
  • Figure 3: Relationship between diffusion steps, inference time, and CoDA-1.7B-Instruct performance. Inference time is measured by the total inference time on the Humaneval dataset. Model performance is measured by pass@1 on the same dataset with a 768-token budget.