DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models
Xiaoxiao He, Quan Dao, Ligong Han, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei-Xu, Chaowei Tan, Bo Liu, Kang Li, Hongdong Li, Junzhou Huang, Faez Ahmed, Akash Srivastava, Dimitris Metaxas
TL;DR
DICE introduces a novel inversion framework for discrete diffusion models, enabling precise reconstruction and controllable editing by recording reverse-path residuals in multinomial diffusion and masked generative models. It avoids dependence on predefined masks or attention manipulation and demonstrates high data fidelity and versatile editing in both image and text domains, including transforming RoBERTa into a generative editor. The method leverages latent residuals and tunable noise-injection strategies (tau, lambda1, lambda2) to balance reconstruction quality and edit strength, validated across Paella, VQ-Diffusion, RoBERTa, and LLaDA with comprehensive quantitative and qualitative results. This work broadens the applicability of discrete diffusion models to fine-grained content manipulation in discrete spaces, with practical implications for controlled editing in vision and language tasks.
Abstract
Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.
