ReFACT: Updating Text-to-Image Models by Editing the Text Encoder
Dana Arad, Hadas Orgad, Yonatan Belinkov
TL;DR
ReFACT addresses the challenge of outdated factual knowledge in text-to-image diffusion models by editing the text encoder rather than retraining. It inserts a learned vector into a single MLP layer to establish a new key–value mapping, guided by a contrastive objective that aligns the edit prompt with the target and separates it from negatives, and uses a closed-form rank-one update to realize the change. Across TIME and RoAD datasets, ReFACT achieves superior efficacy, generalization, and specificity while preserving image generation quality, outperforming cross-attention edits and personalization baselines. This approach enables practical, scalable updates to deployed text-to-image systems, reducing maintenance costs and enabling timely corrections to factual information without user-facing prompt engineering.
Abstract
Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-image models without relaying on explicit input from end-users or costly re-training. ReFACT updates the weights of a specific layer in the text encoder, modifying only a tiny portion of the model's parameters and leaving the rest of the model unaffected. We empirically evaluate ReFACT on an existing benchmark, alongside a newly curated dataset. Compared to other methods, ReFACT achieves superior performance in both generalization to related concepts and preservation of unrelated concepts. Furthermore, ReFACT maintains image generation quality, making it a practical tool for updating and correcting factual information in text-to-image models.
