Table of Contents
Fetching ...

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

Dana Arad, Hadas Orgad, Yonatan Belinkov

TL;DR

ReFACT addresses the challenge of outdated factual knowledge in text-to-image diffusion models by editing the text encoder rather than retraining. It inserts a learned vector into a single MLP layer to establish a new key–value mapping, guided by a contrastive objective that aligns the edit prompt with the target and separates it from negatives, and uses a closed-form rank-one update to realize the change. Across TIME and RoAD datasets, ReFACT achieves superior efficacy, generalization, and specificity while preserving image generation quality, outperforming cross-attention edits and personalization baselines. This approach enables practical, scalable updates to deployed text-to-image systems, reducing maintenance costs and enabling timely corrections to factual information without user-facing prompt engineering.

Abstract

Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-image models without relaying on explicit input from end-users or costly re-training. ReFACT updates the weights of a specific layer in the text encoder, modifying only a tiny portion of the model's parameters and leaving the rest of the model unaffected. We empirically evaluate ReFACT on an existing benchmark, alongside a newly curated dataset. Compared to other methods, ReFACT achieves superior performance in both generalization to related concepts and preservation of unrelated concepts. Furthermore, ReFACT maintains image generation quality, making it a practical tool for updating and correcting factual information in text-to-image models.

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

TL;DR

ReFACT addresses the challenge of outdated factual knowledge in text-to-image diffusion models by editing the text encoder rather than retraining. It inserts a learned vector into a single MLP layer to establish a new key–value mapping, guided by a contrastive objective that aligns the edit prompt with the target and separates it from negatives, and uses a closed-form rank-one update to realize the change. Across TIME and RoAD datasets, ReFACT achieves superior efficacy, generalization, and specificity while preserving image generation quality, outperforming cross-attention edits and personalization baselines. This approach enables practical, scalable updates to deployed text-to-image systems, reducing maintenance costs and enabling timely corrections to factual information without user-facing prompt engineering.

Abstract

Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-image models without relaying on explicit input from end-users or costly re-training. ReFACT updates the weights of a specific layer in the text encoder, modifying only a tiny portion of the model's parameters and leaving the rest of the model unaffected. We empirically evaluate ReFACT on an existing benchmark, alongside a newly curated dataset. Compared to other methods, ReFACT achieves superior performance in both generalization to related concepts and preservation of unrelated concepts. Furthermore, ReFACT maintains image generation quality, making it a practical tool for updating and correcting factual information in text-to-image models.
Paper Structure (48 sections, 4 equations, 28 figures, 3 tables)

This paper contains 48 sections, 4 equations, 28 figures, 3 tables.

Figures (28)

  • Figure 1: ReFACT edits knowledge in text-to-image models using an editing prompt and a target prompt (e.g., "The President of the United States" is edited to "Joe Biden"). The edit generalizes to prompts unseen during editing.
  • Figure 2: (A) An overview of a diffusion text-to-image model after editing with ReFACT. The edited text encoder generates textual representations reflecting the updated information. Then, the representations are fed into the cross-attention mechanism of a diffusion model, generating an image reflecting the new fact. (B) ReFACT receives an edit prompt and a target prompt representing the desired change. We obtain the representation of the target and other contrastive examples by passing it through the frozen CLIP text encoder and taking the output at the [EOS] token. Then, we optimize a vector $v^*$ that, when inserted in a specific layer, will reduce the distance between the edit and the target prompts representation, and increase the distance with respect to the contrastive examples. The vector $v^*$ is then planted in the MLP layer using a closed form solution.
  • Figure 3: Samples from the two datasets, TIME dataset and RoAD. TIME dataset contains editing of implicit model assumptions while RoAD targets a general visual appearance of the edited subject. Each entry of RoAD contains five positive prompts and five negative prompts, used for evaluation.
  • Figure 4: Specificty of ReFACT. Our method is able to precisely edit specific concepts without affecting related concepts or other elements in the generated image.
  • Figure 5: ReFACT is able to generalize to related prompts.
  • ...and 23 more figures