Table of Contents
Fetching ...

Unlocking Prompt Infilling Capability for Diffusion Language Models

Yoshinari Fujinuma, Keisuke Sakaguchi

Abstract

Masked diffusion language models (dLMs) generate text through bidirectional denoising, yet this capability remains locked for infilling prompts. This limitation is an artifact of the current supervised finetuning (SFT) convention of applying response-only masking. To unlock this capability, we extend full-sequence masking during SFT, where both prompts and responses are masked jointly. Once unlocked, the model infills masked portions of a prompt template conditioned on few-shot examples. We show that such model-infilled prompts match or surpass manually designed templates, transfer effectively across models, and are complementary to existing prompt optimization methods. Our results suggest that training practices, not architectural limitations, are the primary bottleneck preventing masked diffusion language models from infilling effective prompts

Unlocking Prompt Infilling Capability for Diffusion Language Models

Abstract

Masked diffusion language models (dLMs) generate text through bidirectional denoising, yet this capability remains locked for infilling prompts. This limitation is an artifact of the current supervised finetuning (SFT) convention of applying response-only masking. To unlock this capability, we extend full-sequence masking during SFT, where both prompts and responses are masked jointly. Once unlocked, the model infills masked portions of a prompt template conditioned on few-shot examples. We show that such model-infilled prompts match or surpass manually designed templates, transfer effectively across models, and are complementary to existing prompt optimization methods. Our results suggest that training practices, not architectural limitations, are the primary bottleneck preventing masked diffusion language models from infilling effective prompts

Paper Structure

This paper contains 48 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of the prompt infilling procedure and the change in training. (1) Gather few-shot examples with prompt templates and reference responses. (2) A diffusion LM (dLM) infills masked tokens in the prompt template, conditioned on the reference responses. (3) Validate infilled prompts by generating responses across all few-shot examples using either a dLM or an LLM. (4) The best infilled prompt is used for final inference on all inputs.
  • Figure 2: Example prompt with infilled tokens using the public LLaDA checkpoint (LLaDA-8B-Instruct) on GSM8K. All masked prefix tokens are filled with EOS tokens, confirming the training-inference gap for prompt infilling.
  • Figure 3: Example of a score rubric with infilled tokens by the model (right) and original score rubric (left). The model replaces descriptions with non-uniform score values (e.g., 1.2, 1.8, 4.8) inferred from few-shot examples, encouraging float score outputs which results in better human correlation.
  • Figure 4: Prompt transfer experiments. (a) GSM8K: infilled prompts achieve higher accuracy with fewer tokens than ICL. (b) SummEval: infilled prompts transfer across LLaDA variants with different training configurations.
  • Figure 5: Full prompt transfer evaluation on SummEval using LLaDA models trained with different configurations (mean $\pm$ std), showing Pearson, Spearman, and Kendall correlations. Baseline results from Table \ref{['tab:llm_judge']} showing Judge Infilled prompts results across different training stages. The transferred prompt improves on Public and RO models showing that the infilled prompts transfer across different LLaDA variants.
  • ...and 1 more figures