Table of Contents
Fetching ...

Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton

Abstract

One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.

Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Abstract

One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.

Paper Structure

This paper contains 61 sections, 2 theorems, 36 equations, 26 figures, 16 tables, 1 algorithm.

Key Result

Lemma G.3

Under Assumption assump:smooth, the soft-embedding surrogate satisfies: Thus the soft-embedding objective $f(\mu)$ matches the true discrete objective up to a second-order term in $\|\Sigma\|$. Furthermore, if the predicted distribution is concentrated on a mode $j^\ast$ with and if all embeddings satisfy $\|e_j\|_2\le B$, then $\|\Sigma\|=\mathcal{O}(\varepsilon)$, and thus the soft-embedding b

Figures (26)

  • Figure 1: Qualitative results produced by our one-step generators distilled from MaskBit weber2024maskbit (top) and MaskGen-L kim2025democratizing (bottom).
  • Figure 2: Soft-Di$\mathtt{[M]}{}$O Pipeline. Given a sampled initialization $x_{\text{init}}$, the one-step generator (student $\theta$) outputs logits $z_{\theta}$; discrete image tokens $x_{\theta}=\{x_{\theta}^i\}_{i=1}^L$ are sampled from these logits and used to form the Di$\mathtt{[M]}{}$O distillation loss (red path). In parallel, the logits are converted into soft embeddings (purple path); these differentiable soft embeddings permit various supervision for post-training, such as GAN loss and reward loss, allowing the student to compete with or even surpass the teacher's performance. The right panel illustrates the soft-embedding construction for each position $i$: logits $z_\theta^i \overset{\mathrm{softmax}}{\longrightarrow} p^i$ and $\sum_j p^i_j E_j$ yield $\tilde{{e}}_\theta^i$. We use $p_j^i=p_\theta(x_0^i=j|{x}_{\text{init}})$ for simplicity. More details on discriminator is visualized in \ref{['fig:zoomin']}.
  • Figure 3: Qualitative comparison of one- and few-step distillation methods by using the same text prompts. Di$\mathtt{[M]}{}$O and Soft-Di$\mathtt{[M]}{}$O are both from MaskGen-L teacher. The number of inference steps is indicated by 's'.
  • Figure 4: Ablation studies on ImageNet-256 using FID as the evaluation metric with MaskBit teacher.
  • Figure 5: Soft-Di$\mathtt{[M]}{}$O Zoom in details: the multi-scale discriminator in the pipeline \ref{['fig:pipeline']}.
  • ...and 21 more figures

Theorems & Definitions (2)

  • Lemma G.3: Second-order bias of Soft Embedding
  • Lemma G.4: First-order bias of Gumbel-Softmax ST