Table of Contents
Fetching ...

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Yuncheng Guo, Xiaodong Gu

TL;DR

MMRL addresses the challenge of adapting large vision-language models with limited data by introducing a shared, learnable representation space that mediates multimodal interactions via representation tokens inserted into higher encoder layers. The framework trains these representation tokens with a trainable projection while freezing the original class-token projection and regularizing it to align with frozen CLIP features, enabling robust generalization. Inference decouples representation and class features for base classes, while retaining class features alone for novel classes, yielding strong base performance without sacrificing zero-shot generalization. Empirical results across 11 datasets show state-of-the-art base-to-novel and cross-dataset performance, along with robust domain generalization and few-shot transfer, demonstrating the practical impact of balanced adaptation and generalization in vision-language models.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.

MMRL: Multi-Modal Representation Learning for Vision-Language Models

TL;DR

MMRL addresses the challenge of adapting large vision-language models with limited data by introducing a shared, learnable representation space that mediates multimodal interactions via representation tokens inserted into higher encoder layers. The framework trains these representation tokens with a trainable projection while freezing the original class-token projection and regularizing it to align with frozen CLIP features, enabling robust generalization. Inference decouples representation and class features for base classes, while retaining class features alone for novel classes, yielding strong base performance without sacrificing zero-shot generalization. Empirical results across 11 datasets show state-of-the-art base-to-novel and cross-dataset performance, along with robust domain generalization and few-shot transfer, demonstrating the practical impact of balanced adaptation and generalization in vision-language models.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.

Paper Structure

This paper contains 24 sections, 9 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comprehensive comparison of the harmonic mean performance between the previous sota method MMA and our proposed MMRL across 11 diverse datasets for base-to-novel generalization. Our method achieves the best on all datasets.
  • Figure 2: MMRL training framework. Here, 'C' denotes the class token, 'B' the BOT token, 'E' the EOT token, $\mathcal{R}$ our representation space, and 'R' the representation token. Only the representation space $\mathcal{R}$, mapping function $\mathcal{F}$, and the patch projection layer for the representation tokens are optimized, while the entire pre-trained CLIP model remains frozen. To preserve generalization knowledge, we integrate representation tokens in both encoders starting from layer $J$.
  • Figure 3: MMRL inference process, where different tasks utilize distinct features.
  • Figure 4: Comparison of MMRL with previous state-of-the-art methods on few-shot learning across 11 datasets. Detailed results on all 11 datasets are provided in the Supplementary Material.
  • Figure 5: Ablation on layers (left) and $K$ (right).