Table of Contents
Fetching ...

MaskInversion: Localized Embeddings via Optimization of Explainability Maps

Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, Vittorio Ferrari, Hilde Kuehne

TL;DR

This paper tackles the challenge of obtaining region-specific representations from pretrained vision-language models like CLIP without fine-tuning. It introduces MaskInversion, which learns a localized embedding token LET_m by optimizing an explainability map to match a user-provided mask while keeping the backbone frozen; a Dice-based objective and an optional global-context regularizer balance local focus with overall image information. A gradient-decomposition technique accelerates inference when handling multiple masks on the same image, and LeGrad provides effective explainability guidance for ViT-based backbones. The resulting region embeddings improve zero-shot local classification, referring-expression retrieval, localized captioning, and region-aware diffusion, outperforming several training-free and some SOTA methods on standard benchmarks such as VOC, PascalContext, COCO, PhraseCut, RefCOCO, and RefCOCO+. This approach enables robust, region-focused understanding and generation with minimal model modification, broadening practical applicability of foundation models in localized vision-language tasks.

Abstract

Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.

MaskInversion: Localized Embeddings via Optimization of Explainability Maps

TL;DR

This paper tackles the challenge of obtaining region-specific representations from pretrained vision-language models like CLIP without fine-tuning. It introduces MaskInversion, which learns a localized embedding token LET_m by optimizing an explainability map to match a user-provided mask while keeping the backbone frozen; a Dice-based objective and an optional global-context regularizer balance local focus with overall image information. A gradient-decomposition technique accelerates inference when handling multiple masks on the same image, and LeGrad provides effective explainability guidance for ViT-based backbones. The resulting region embeddings improve zero-shot local classification, referring-expression retrieval, localized captioning, and region-aware diffusion, outperforming several training-free and some SOTA methods on standard benchmarks such as VOC, PascalContext, COCO, PhraseCut, RefCOCO, and RefCOCO+. This approach enables robust, region-focused understanding and generation with minimal model modification, broadening practical applicability of foundation models in localized vision-language tasks.

Abstract

Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.
Paper Structure (34 sections, 11 equations, 12 figures, 9 tables)

This paper contains 34 sections, 11 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: MaskInversion Applications: The proposed MaskInversion method generates a localized embedding without modifying the vision encoder, thereby enabling seamless integration as a drop-in replacement for the vision encoder output across various scenarios, such as Localized Classification to classify a specific region of an image, Localized Captioning to direct the attention of an LLM to specific parts of an image, or Localized Diffusion where the embedding is used in conjunction with a diffusion model to generate variations of specific regions of images.
  • Figure 2: Overview of the proposed method:Step 0: the input image is forwarded only once during the whole MaskInversion process. Step 1: the localized embedding token $LET_\textbf{m}$ is initialized by the vision encoder's [CLS] token. The $LET_\textbf{m}$ is then trained such that its explainability map correlates to the query mask. Step K: after $K$ gradient descent steps, we obtain the final localized embedding $LET_\textbf{m}$ that can be used for downstream tasks.
  • Figure 3: Localized Embedding Visualizations: Visualisation of the learned localized embedding using (left) a pretrained diffusion model; (right) an image captioner. In both cases, the global feature representation is replaced by the output of MaskInversion depending on the query mask.
  • Figure 4: Visualization of the Explainability Maps throughout the optimization steps.
  • Figure 5: Convergence Analysis of MaskInversion. The plot illustrates the optimization loss (red, left axis) and the resulting accuracy on PascalVOC (blue, right axis) over iterations. The dotted line marks the chosen stopping point at $K=10$ iterations.
  • ...and 7 more figures