Table of Contents
Fetching ...

DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

Xiaoxiao He, Quan Dao, Ligong Han, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei-Xu, Chaowei Tan, Bo Liu, Kang Li, Hongdong Li, Junzhou Huang, Faez Ahmed, Akash Srivastava, Dimitris Metaxas

TL;DR

DICE introduces a novel inversion framework for discrete diffusion models, enabling precise reconstruction and controllable editing by recording reverse-path residuals in multinomial diffusion and masked generative models. It avoids dependence on predefined masks or attention manipulation and demonstrates high data fidelity and versatile editing in both image and text domains, including transforming RoBERTa into a generative editor. The method leverages latent residuals and tunable noise-injection strategies (tau, lambda1, lambda2) to balance reconstruction quality and edit strength, validated across Paella, VQ-Diffusion, RoBERTa, and LLaDA with comprehensive quantitative and qualitative results. This work broadens the applicability of discrete diffusion models to fine-grained content manipulation in discrete spaces, with practical implications for controlled editing in vision and language tasks.

Abstract

Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.

DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

TL;DR

DICE introduces a novel inversion framework for discrete diffusion models, enabling precise reconstruction and controllable editing by recording reverse-path residuals in multinomial diffusion and masked generative models. It avoids dependence on predefined masks or attention manipulation and demonstrates high data fidelity and versatile editing in both image and text domains, including transforming RoBERTa into a generative editor. The method leverages latent residuals and tunable noise-injection strategies (tau, lambda1, lambda2) to balance reconstruction quality and edit strength, validated across Paella, VQ-Diffusion, RoBERTa, and LLaDA with comprehensive quantitative and qualitative results. This work broadens the applicability of discrete diffusion models to fine-grained content manipulation in discrete spaces, with practical implications for controlled editing in vision and language tasks.

Abstract

Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.

Paper Structure

This paper contains 23 sections, 24 equations, 17 figures, 9 tables, 2 algorithms.

Figures (17)

  • Figure 1: Illustration of the limitation of masked inpainting method. Inpainting with masked generation inadvertently modifies the orientation of the head, resulting in a less favourable result. With our discrete inversion method, we are able to edit the image while preserving other properties of the object being edited. This is achieved by injecting the information from the input image into the logit space. Dotted red box indicates the masked region.
  • Figure 2: Here we demonstrate the two types of reconstruction and editing paradigms, namely ODE-based and Non-ODE based. (a,b) shows the ODE-based editing and reconstructions, while it provides accurate editing and reconstruction performances, it highly depends on the underlying ODE trajectory, which is not feasible in the discrete diffusion. However, the Non-ODE editing samples a trajectory by directly adding noise to $x_0$ and record the difference between the predicted $x_{t-1}$ and the sampled $x_{t-1}$ as indicated in the red arrow (c,d). In this way, we are able to reconstruct/edit the image without the strong condition of having an underlying ODE. (e,f) illustrate inversion and editing process for masked generative modeling (MGM) as in Algorithm \ref{['alg:1']}.
  • Figure 3: Visualization of editing results. Editing results for our method using Paella and VQ-Diffusion are presented, along with their corresponding prompts. The results demonstrate that our method can effectively modify the input image according to the target prompt while preserving the image structure. Editing with masked generative model (Paella rampas2022novel) is more stable and easier than with multinomial diffusion models (VQ-Diffusion gu2022vector).
  • Figure 4: CVPR Situation
  • Figure 5: Mutual information between $\boldsymbol{z}_t$ and $\boldsymbol{x}_0$. Computed with a simple DDPM setting by assuming $\boldsymbol{x}_0\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$.
  • ...and 12 more figures

Theorems & Definitions (2)

  • Remark C.1
  • proof