Table of Contents
Fetching ...

Modular Embedding Recomposition for Incremental Learning

Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

TL;DR

MoDER addresses zero-shot continual learning with Vision-Language Models by building a modular library of textual experts stored in a foundational hub and composing them to form refined prototypes for unseen classes. It introduces Textual Alignment to train class-specific experts and Mixture of Textual Experts (MoTE) to forge new prototypes on the fly, enhanced by $\alpha$-smoothing and template augmentation for robustness. Across Class-IL and MTIL benchmarks (14 datasets), MoDER achieves state-of-the-art CI-Transfer and strong Final Average Accuracy, while using far fewer trainable parameters and enabling single-pass inference. The approach offers a scalable, online-friendly, privacy-conscious framework for modular knowledge reuse in VLMs.

Abstract

The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

Modular Embedding Recomposition for Incremental Learning

TL;DR

MoDER addresses zero-shot continual learning with Vision-Language Models by building a modular library of textual experts stored in a foundational hub and composing them to form refined prototypes for unseen classes. It introduces Textual Alignment to train class-specific experts and Mixture of Textual Experts (MoTE) to forge new prototypes on the fly, enhanced by -smoothing and template augmentation for robustness. Across Class-IL and MTIL benchmarks (14 datasets), MoDER achieves state-of-the-art CI-Transfer and strong Final Average Accuracy, while using far fewer trainable parameters and enabling single-pass inference. The approach offers a scalable, online-friendly, privacy-conscious framework for modular knowledge reuse in VLMs.

Abstract

The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

Paper Structure

This paper contains 13 sections, 5 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: An overview of our approach MoDER. The left side depicts the generative modeling and Textual Alignment (TA) phases. The right side represents the forging of the embedding for the unseen classes.
  • Figure 2: For various benchmarks, the accuracy trend in Class-Incremental transfer indicates the model's effectiveness in transferring to unseen classes in future tasks. A higher trend reflects greater effectiveness in adapting to unseen classes.
  • Figure : MoDER --- Training Phase