Table of Contents
Fetching ...

Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

Xiaoqi Han, Ru Li, Ran Yi, Hongye Tan, Zhuomin Liang, Víctor Gutiérrez-Basulto, Jeff Z. Pan

TL;DR

We address transient blindness in multimodal model editing by introducing De-VQA, a plug-and-play dynamic evaluation framework that quantifies locality across Random-Image Locality (RI-Loc), No-Image Locality (NI-Loc), and Consistent-Image Locality (CI-Loc) using seven data types. De-VQA reveals that existing MMED methods overfit to edit-related text and underutilize visual information, a deficiency termed transient blindness. The authors propose a locality-aware adversarial loss, enabling balanced cross-modal updates, and demonstrate a 17% average gain in locality across two representative multimodal editors on multiple datasets while maintaining edit accuracy. Collectively, De-VQA provides a rigorous, scalable benchmark for cross-modal locality and a practical mitigation strategy to stabilize multimodal knowledge updates in real-world settings.

Abstract

Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.

Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

TL;DR

We address transient blindness in multimodal model editing by introducing De-VQA, a plug-and-play dynamic evaluation framework that quantifies locality across Random-Image Locality (RI-Loc), No-Image Locality (NI-Loc), and Consistent-Image Locality (CI-Loc) using seven data types. De-VQA reveals that existing MMED methods overfit to edit-related text and underutilize visual information, a deficiency termed transient blindness. The authors propose a locality-aware adversarial loss, enabling balanced cross-modal updates, and demonstrate a 17% average gain in locality across two representative multimodal editors on multiple datasets while maintaining edit accuracy. Collectively, De-VQA provides a rigorous, scalable benchmark for cross-modal locality and a practical mitigation strategy to stabilize multimodal knowledge updates in real-world settings.

Abstract

Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.

Paper Structure

This paper contains 34 sections, 16 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Current locality evaluation focuses only on low-similarity data (a), while the edited model fail on high-similarity cases (b).
  • Figure 2: Overview of De-VQA: dynamic sampling of related and unrelated image-text pairs $(T_i,I_j)$, $i,j\in\{2,3,4\}$, for the edited pair $(T_1I_1)$. Consistent-Image indicates that either the image or textual input are related to the edited data. Random-Image represents cases of image-text mismatch. No-Image denotes text-only inputs without any accompanying image.
  • Figure 3: Causal information flow in multimodal models. The red paths highlight the causal trace originating from image tokens. After editing high layer (gray area), the causal influence from image tokens is blocked (indicated by the black paths), while the flow from text tokens remains unaffected.
  • Figure 4: Main experiment results on Blip2OPT and MiniGPT4. The lower performance of existing editing methods on {CI ($T_2I_2$), RI ($T_1I_3$), NI ($T_1I_4$)}-Loc compared to the better performance on {T,I}-Loc reflects the inadequacies of the original locality evaluation. Our method (black node) can achieve more comprehensive performance in terms of locality.
  • Figure 5: Locality metric performance comparison on MiniGPT4. Different colors denote metric types.
  • ...and 4 more figures