Table of Contents
Fetching ...

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E Turner

TL;DR

The paper addresses how vision-language models, typically built with frozen CLIP-style vision encoders, struggle under distribution shifts. It introduces LoRSU, a parameter-efficient, locality-preserving method that updates a targeted subset of vision-transformer parameters by ranking head importance via gradient norms and applying LoRA-style low-rank updates, plus selective masking of the first MLP layer. The approach is theoretically justified and evaluated across multiple CLIP backbones and VLMs in offline and continual few-shot settings, showing near-full-finetune performance on target tasks with minimal forgetting on other data. The work demonstrates practical, scalable improvements to VLM robustness and continual adaptation, with strong VQA gains and broad applicability to transformer-based encoders. Overall, LoRSU offers a principled framework for efficient, localized vision-encoder updates that enhance vision-language alignment under challenging, unseen conditions.

Abstract

Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and computational efficiency characterize our approach.

Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models

TL;DR

The paper addresses how vision-language models, typically built with frozen CLIP-style vision encoders, struggle under distribution shifts. It introduces LoRSU, a parameter-efficient, locality-preserving method that updates a targeted subset of vision-transformer parameters by ranking head importance via gradient norms and applying LoRA-style low-rank updates, plus selective masking of the first MLP layer. The approach is theoretically justified and evaluated across multiple CLIP backbones and VLMs in offline and continual few-shot settings, showing near-full-finetune performance on target tasks with minimal forgetting on other data. The work demonstrates practical, scalable improvements to VLM robustness and continual adaptation, with strong VQA gains and broad applicability to transformer-based encoders. Overall, LoRSU offers a principled framework for efficient, localized vision-encoder updates that enhance vision-language alignment under challenging, unseen conditions.

Abstract

Vision language models (VLMs) demonstrate impressive capabilities in visual question answering and image captioning, acting as a crucial link between visual and language models. However, existing open-source VLMs heavily rely on pretrained and frozen vision encoders (such as CLIP). Despite CLIP's robustness across diverse domains, it still exhibits non-negligible image understanding errors. These errors propagate to the VLM responses, resulting in sub-optimal performance. In our work, we propose an efficient and robust method for updating vision encoders within VLMs. Our approach selectively and locally updates encoders, leading to substantial performance improvements on data where previous mistakes occurred, while maintaining overall robustness. Furthermore, we demonstrate the effectiveness of our method during continual few-shot updates. Theoretical grounding, generality, and computational efficiency characterize our approach.
Paper Structure (19 sections, 2 theorems, 9 equations, 12 figures, 17 tables)

This paper contains 19 sections, 2 theorems, 9 equations, 12 figures, 17 tables.

Key Result

Lemma 3.2

For any $\mathbf{x} \in \mathbb{R}^d-\{ \mathbf{0}\}$, $1 \leq S \leq d$, the optimal mask has zeros everywhere except the $S$ largest elements of $\mathbf{x}$ in magnitude.

Figures (12)

  • Figure 1: Left: samples of images from TSI dataset compared to DALL-E generated images for the same labels. Right: sample of MiniGPT-v2 responses given TSI and DALL-E images indicated by a yellow arrow.
  • Figure 1: MiniGPTv2 VQA Accuracy (%) after finetuning the LLM (with LoRa $r=64$) compared to finetuning EVA-CLIP-G separately.
  • Figure 2: Use Laptop Example
  • Figure 3: Watching TV Example
  • Figure 4: Use Tablet Example
  • ...and 7 more figures

Theorems & Definitions (6)

  • Definition 3.1
  • Lemma 3.2
  • proof
  • Remark 3.3
  • Corollary 3.4
  • proof