Table of Contents
Fetching ...

Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

Zixuan Hu, Yongxian Wei, Li Shen, Zhenyi Wang, Lei Li, Chun Yuan, Dacheng Tao

TL;DR

This work tackles the inefficiency of dense model inversion on high-resolution Vision Transformers by identifying redundant background inversion and hallucinated spurious correlations. It introduces Sparse Model Inversion (SMI), which uses CLS-attention-based semantic patch identification and progressive early stopping to invert only informative foreground patches, without altering the original inversion losses. Theoretical analysis shows reduced sample and iteration requirements, and empirical results demonstrate up to $3.79\times$ speedups with comparable or improved performance on data-free model quantization and data-free knowledge transfer across ViT variants. The approach enables scalable, data-free data synthesis from ViTs with practical impact on model compression and knowledge transfer while mitigating leakage of spurious cues.

Abstract

Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.

Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

TL;DR

This work tackles the inefficiency of dense model inversion on high-resolution Vision Transformers by identifying redundant background inversion and hallucinated spurious correlations. It introduces Sparse Model Inversion (SMI), which uses CLS-attention-based semantic patch identification and progressive early stopping to invert only informative foreground patches, without altering the original inversion losses. Theoretical analysis shows reduced sample and iteration requirements, and empirical results demonstrate up to speedups with comparable or improved performance on data-free model quantization and data-free knowledge transfer across ViT variants. The approach enables scalable, data-free data synthesis from ViTs with practical impact on model compression and knowledge transfer while mitigating leakage of spurious cues.

Abstract

Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations--a phenomenon we term "hallucination" in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.

Paper Structure

This paper contains 23 sections, 1 theorem, 8 equations, 9 figures, 8 tables.

Key Result

Lemma 4.1

Li00C23 Under certain assumptions, a ViT with initial errors $\sigma$ and $\delta$ for value and query/key vectors respectively, and trained via SGD with step size $\eta$, can achieve zero generalization error (i.e., population risk achieves zero) with a probability of at least 0.99. This outcome is where $c^{\prime}, c^{\prime \prime}>0$ are constants, and $\zeta \gtrsim 1-\eta^{10}$.

Figures (9)

  • Figure 1: The inefficiency of dense inversion (e.g., DeepInversion yin2020dreaming) arises from (a) redundant inversion of noisy backgrounds, and (b) unintended inversion of spurious correlations between foregrounds (green) and backgrounds (red), which are improperly memorized in pre-trained models.
  • Figure 2: Recipe for model inversion and applications. Our approach selectively inverts semantic foreground patches while progressively stopping the inversion of uninformative background ones. When utilizing sparsely inverted data for downstream applications, we feed forward only the retained foreground patches.
  • Figure 3: Overall process of sparse model inversion. As the inversion progresses, our approach selectively inverts semantic foreground patches while progressively stopping the inversion of uninformative background patches (marked as black blocks). Those stopped patches are directly discarded, with no further feed-forward processing and backward gradient computation, and thus are excluded from inversion ever since. The final inverted image only retains sparse patches with semantically meaningful information.
  • Figure 4: Impact of utilizing sparsely (blue curve) versus densely (orange curve) inverted data on training loss (left) and validation accuracy (right) throughout the knowledge transfer process.
  • Figure 5: Our inverted images of $224\times 224$ pixels from ViT/32-Base encompass a wide range of datasets, from natural images (CIFAR100 and MiniIma- geNet) to more specialized categories (Oxford 102 Flower for various flower species and CUB-200-2011 for bird species).
  • ...and 4 more figures

Theorems & Definitions (3)

  • Remark 4.1
  • Remark 4.2
  • Lemma 4.1