Table of Contents
Fetching ...

Overcoming Generic Knowledge Loss with Selective Parameter Update

Wenxuan Zhang, Paul Janson, Rahaf Aljundi, Mohamed Elhoseiny

TL;DR

This work tackles the problem of updating foundation models without erasing their broad pretraining knowledge. It proposes Selective Parameter Update (SPU), which localizes learning to the first layer of MLP blocks in transformer cores and selects a sparse subset of parameters via a gradient-based scoring function, updating only those while replay buffers mitigate forgetting. Across six CL sequences on a CLIP-based vision-language backbone, SPU yields substantial gains on new tasks (up to ~7% in some cases) with minimal degradation on a pretraining control set (around ~0.9%), using only about $3\%$ of the model parameters for learning. The approach is efficient (no extra data or parameters) and broadly applicable to other architectures, offering a practical path to continually expanding the knowledge base of large pretrained models without sacrificing generic capabilities.

Abstract

Foundation models encompass an extensive knowledge base and offer remarkable transferability. However, this knowledge becomes outdated or insufficient over time. The challenge lies in continuously updating foundation models to accommodate novel information while retaining their original capabilities. Leveraging the fact that foundation models have initial knowledge on various tasks and domains, we propose a novel approach that, instead of updating all parameters equally, localizes the updates to a sparse set of parameters relevant to the task being learned. We strike a balance between efficiency and new task performance, while maintaining the transferability and generalizability of foundation models. We extensively evaluate our method on foundational vision-language models with a diverse spectrum of continual learning tasks. Our method achieves improvements on the accuracy of the newly learned tasks up to 7% while preserving the pretraining knowledge with a negligible decrease of 0.9% on a representative control set accuracy.

Overcoming Generic Knowledge Loss with Selective Parameter Update

TL;DR

This work tackles the problem of updating foundation models without erasing their broad pretraining knowledge. It proposes Selective Parameter Update (SPU), which localizes learning to the first layer of MLP blocks in transformer cores and selects a sparse subset of parameters via a gradient-based scoring function, updating only those while replay buffers mitigate forgetting. Across six CL sequences on a CLIP-based vision-language backbone, SPU yields substantial gains on new tasks (up to ~7% in some cases) with minimal degradation on a pretraining control set (around ~0.9%), using only about of the model parameters for learning. The approach is efficient (no extra data or parameters) and broadly applicable to other architectures, offering a practical path to continually expanding the knowledge base of large pretrained models without sacrificing generic capabilities.

Abstract

Foundation models encompass an extensive knowledge base and offer remarkable transferability. However, this knowledge becomes outdated or insufficient over time. The challenge lies in continuously updating foundation models to accommodate novel information while retaining their original capabilities. Leveraging the fact that foundation models have initial knowledge on various tasks and domains, we propose a novel approach that, instead of updating all parameters equally, localizes the updates to a sparse set of parameters relevant to the task being learned. We strike a balance between efficiency and new task performance, while maintaining the transferability and generalizability of foundation models. We extensively evaluate our method on foundational vision-language models with a diverse spectrum of continual learning tasks. Our method achieves improvements on the accuracy of the newly learned tasks up to 7% while preserving the pretraining knowledge with a negligible decrease of 0.9% on a representative control set accuracy.
Paper Structure (26 sections, 10 equations, 4 figures, 12 tables)

This paper contains 26 sections, 10 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: We propose SPU algorithm. We first localize our update to the first layer of MLP blocks, and then select a sparse set of parameters specialized to the new task to update.
  • Figure 2: Casual tracking results of visual and text tower of CLIP. Changing MLP layers has a higher effect on the CLIP prediction results than changing Attention layers.
  • Figure 3: Highlighted regions by activations of selected neurons in the first MLP layers of the 9th transformer block in gScoreCAM visualization. Selected neurons represent meaningful features in the input image.
  • Figure 4: Repeat rate of the selected weight in visual and text tower of layer 5 and layer 10 in CLIP. The shared weight selected two different tasks only counts a small amount of total selected weight.