Table of Contents
Fetching ...

Efficient Few-Shot Continual Learning in Vision-Language Models

Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E. Turner

TL;DR

LoRSU tackles the challenge of continual, few-shot adaptation in vision–language models by selectively updating the image encoder with structured, low-rank updates. It uses gradient-informed selection to fine-tune a small subset of attention heads via LoRA adapters and a masked first FC layer, preserving the language model while correcting visual feature extraction. Across ten VQA datasets and three CL settings, LoRSU achieves strong Target Improvement with minimal forgetting, and its replay-free design yields substantial computational efficiency (about 25× less than full-model updates). The approach demonstrates robust performance, scalability, and practicality for resource-constrained deployment, and sets a new benchmark for few-shot continual learning in VLMs. The work also provides a detailed empirical and methodological framework, including new datasets (TSI and DALLE) and comprehensive ablations on head selection and update schemes.

Abstract

Vision-language models (VLMs) excel in tasks such as visual question answering and image captioning. However, VLMs are often limited by their use of pretrained image encoders, like CLIP, leading to image understanding errors that hinder overall performance. On top of that, real-world applications often require the model to be continuously adapted as new and often limited data continuously arrive. To address this, we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. Experimental results on VQA tasks in the few-shot continual learning setting, validate LoRSU's scalability, efficiency, and effectiveness, making it a compelling solution for image encoder adaptation in resource-constrained environments.

Efficient Few-Shot Continual Learning in Vision-Language Models

TL;DR

LoRSU tackles the challenge of continual, few-shot adaptation in vision–language models by selectively updating the image encoder with structured, low-rank updates. It uses gradient-informed selection to fine-tune a small subset of attention heads via LoRA adapters and a masked first FC layer, preserving the language model while correcting visual feature extraction. Across ten VQA datasets and three CL settings, LoRSU achieves strong Target Improvement with minimal forgetting, and its replay-free design yields substantial computational efficiency (about 25× less than full-model updates). The approach demonstrates robust performance, scalability, and practicality for resource-constrained deployment, and sets a new benchmark for few-shot continual learning in VLMs. The work also provides a detailed empirical and methodological framework, including new datasets (TSI and DALLE) and comprehensive ablations on head selection and update schemes.

Abstract

Vision-language models (VLMs) excel in tasks such as visual question answering and image captioning. However, VLMs are often limited by their use of pretrained image encoders, like CLIP, leading to image understanding errors that hinder overall performance. On top of that, real-world applications often require the model to be continuously adapted as new and often limited data continuously arrive. To address this, we propose LoRSU (Low-Rank Adaptation with Structured Updates), a robust and computationally efficient method for selectively updating image encoders within VLMs. LoRSU introduces structured and localized parameter updates, effectively correcting performance on previously error-prone data while preserving the model's general robustness. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. Experimental results on VQA tasks in the few-shot continual learning setting, validate LoRSU's scalability, efficiency, and effectiveness, making it a compelling solution for image encoder adaptation in resource-constrained environments.

Paper Structure

This paper contains 37 sections, 2 theorems, 8 equations, 7 figures, 25 tables.

Key Result

Lemma 1.2

For any $\mathbf{x} \in \mathbb{R}^d-\{ \mathbf{0}\}$, $1 \leq C \leq d$, the optimal mask has zeros everywhere except the $C$ largest elements of $\mathbf{x}$ in magnitude.

Figures (7)

  • Figure 1: (Left) Responses of the pretrained LLaVA to samples from TSI dataset (bottom) compared to DALL·E 2 generated images (top) for the 'cooking on a stove' class. (Right) LLaVA’s correct response to the same TSI image after fine-tuning LLaVA using LoRSU.
  • Figure 2: LoRSU mechanism: After computing the gradient $\nabla_{\boldsymbol{\theta}} \mathcal{L}_t (\boldsymbol{\theta})$ over the target dataset at time $t$, LoRSU picks a small number of attention heads and a small number of paremeters from the first linear layer of the MLP module in the transformer block based on the magnitude of the gradients of $\nabla_{W_{\text{Attn}}} \mathcal{L}_t$ and $\nabla_{W_{\text{fc1}}} \mathcal{L}_t$, respectively. Computational efficiency is ensured by introducing LoRA adapters to the attention weight matrices.
  • Figure 3: TFlops and trainable parameters comparison between LoRSU with CLIP loss (LoRSU), perplexity loss (LoRSU-Ppl), and LoRA-F.
  • Figure 4: Instances of the 'Use Laptop' action.
  • Figure 5: Instances of the 'Watching TV' action.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 1.1
  • Lemma 1.2
  • proof
  • Remark 1.3
  • Corollary 1.4
  • proof