Table of Contents
Fetching ...

Learning Where to Edit Vision Transformers

Yunqiao Yang, Long-Kai Huang, Shengzhuang Chen, Kede Ma, Ying Wei

TL;DR

This paper takes initial steps towards correcting predictive errors of ViTs, particularly those arising from subpopulation shifts, by meta-learning a hypernetwork on CutMix-augmented data generated for editing reliability and fine-tuning the identified parameters using a variant of gradient descent.

Abstract

Model editing aims to data-efficiently correct predictive errors of large pre-trained models while ensuring generalization to neighboring failures and locality to minimize unintended effects on unrelated examples. While significant progress has been made in editing Transformer-based large language models, effective strategies for editing vision Transformers (ViTs) in computer vision remain largely untapped. In this paper, we take initial steps towards correcting predictive errors of ViTs, particularly those arising from subpopulation shifts. Taking a locate-then-edit approach, we first address the where-to-edit challenge by meta-learning a hypernetwork on CutMix-augmented data generated for editing reliability. This trained hypernetwork produces generalizable binary masks that identify a sparse subset of structured model parameters, responsive to real-world failure samples. Afterward, we solve the how-to-edit problem by simply fine-tuning the identified parameters using a variant of gradient descent to achieve successful edits. To validate our method, we construct an editing benchmark that introduces subpopulation shifts towards natural underrepresented images and AI-generated images, thereby revealing the limitations of pre-trained ViTs for object recognition. Our approach not only achieves superior performance on the proposed benchmark but also allows for adjustable trade-offs between generalization and locality. Our code is available at https://github.com/hustyyq/Where-to-Edit.

Learning Where to Edit Vision Transformers

TL;DR

This paper takes initial steps towards correcting predictive errors of ViTs, particularly those arising from subpopulation shifts, by meta-learning a hypernetwork on CutMix-augmented data generated for editing reliability and fine-tuning the identified parameters using a variant of gradient descent.

Abstract

Model editing aims to data-efficiently correct predictive errors of large pre-trained models while ensuring generalization to neighboring failures and locality to minimize unintended effects on unrelated examples. While significant progress has been made in editing Transformer-based large language models, effective strategies for editing vision Transformers (ViTs) in computer vision remain largely untapped. In this paper, we take initial steps towards correcting predictive errors of ViTs, particularly those arising from subpopulation shifts. Taking a locate-then-edit approach, we first address the where-to-edit challenge by meta-learning a hypernetwork on CutMix-augmented data generated for editing reliability. This trained hypernetwork produces generalizable binary masks that identify a sparse subset of structured model parameters, responsive to real-world failure samples. Afterward, we solve the how-to-edit problem by simply fine-tuning the identified parameters using a variant of gradient descent to achieve successful edits. To validate our method, we construct an editing benchmark that introduces subpopulation shifts towards natural underrepresented images and AI-generated images, thereby revealing the limitations of pre-trained ViTs for object recognition. Our approach not only achieves superior performance on the proposed benchmark but also allows for adjustable trade-offs between generalization and locality. Our code is available at https://github.com/hustyyq/Where-to-Edit.

Paper Structure

This paper contains 51 sections, 12 equations, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 1: System diagram of the proposed model editing method.
  • Figure 2: The left subfigure shows representative editing examples, highlighting the predictive errors of the base ViT when predicting volleyball as basketball. The right subfigure depicts the generalization and locality trade-offs when editing different groups of FFNs or MSAs in the base ViT. It is evident that editing the $8$-th to $10$-th FFNs achieves the optimal Pareto front.
  • Figure 3: Visual examples seen by the base ViT/B-16 during pre-training, contrasted with visual examples in the proposed editing benchmark as predictive errors of the base ViT/B-16.
  • Figure 4: Editing results for ViT/B-16 on the proposed benchmark.
  • Figure 5: Ablation results of the hypernetwork for ViT/B-16.
  • ...and 15 more figures