Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Haixin Wang; Xinlong Yang; Jianlong Chang; Dian Jin; Jinan Sun; Shikun Zhang; Xiao Luo; Qi Tian

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Haixin Wang, Xinlong Yang, Jianlong Chang, Dian Jin, Jinan Sun, Shikun Zhang, Xiao Luo, Qi Tian

TL;DR

Aurora addresses the problem of adapting large multimodal foundation models to downstream tasks under extreme parameter budgets. It introduces a mode-approximation CP decomposition-based prompt mechanism to generate a tiny delta to the frozen backbone, achieving about $0.1$M trainable parameters (roughly $0.04\%$ of the backbone). Complementary components, Informative Context Enhancement and Gated Query Transformation, improve cross-modal alignment with minimal parameter overhead. Across six cross-modal benchmarks and zero-shot settings, Aurora frequently matches or surpasses full fine-tuning and outperforms many PETL baselines, highlighting the practical impact of lightweight, well-structured cross-modal prompts for efficient multimodal transfer.

Abstract

Driven by the progress of large-scale pre-training, parameter-efficient transfer learning has gained immense popularity across different subfields of Artificial Intelligence. The core is to adapt the model to downstream tasks with only a small set of parameters. Recently, researchers have leveraged such proven techniques in multimodal tasks and achieve promising results. However, two critical issues remain unresolved: how to further reduce the complexity with lightweight design and how to boost alignment between modalities under extremely low parameters. In this paper, we propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning, which explores the low intrinsic dimension with only 0.04% parameters of the pre-trained model. Then, for better modality alignment, we propose the Informative Context Enhancement and Gated Query Transformation module under extremely few parameters scenes. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach. Our code is available at: https://github.com/WillDreamer/Aurora.

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

TL;DR

M trainable parameters (roughly

of the backbone). Complementary components, Informative Context Enhancement and Gated Query Transformation, improve cross-modal alignment with minimal parameter overhead. Across six cross-modal benchmarks and zero-shot settings, Aurora frequently matches or surpasses full fine-tuning and outperforms many PETL baselines, highlighting the practical impact of lightweight, well-structured cross-modal prompts for efficient multimodal transfer.

Abstract

Paper Structure (28 sections, 1 theorem, 25 equations, 13 figures, 5 tables)

This paper contains 28 sections, 1 theorem, 25 equations, 13 figures, 5 tables.

Introduction
Related Work
Vision-Language Models
Parameter-efficient Transfer Learning
Tensor Decomposition
Methodology
Background
Lightweight Design for PETL
Modality Alignment Design
Experiments
Experimental Settings
Performance Comparisons on Cross-modal Tasks
Performance Comparisons on Zero-shot Setting
Analysis of Different Designs
Visualization Analysis
...and 13 more sections

Key Result

Theorem F.1

Under the above assumptions, and suppose that we train for $n$ epochs with $\eta \leq 1/M$ using gradient descent. Let $\boldsymbol{\mathcal{W}}^*$ be the optimal parameter tensor, then, Moreover, $\boldsymbol{\mathcal{W}}^*$ is unique.

Figures (13)

Figure 1: Comparison of existing PETL methods for downstream cross-modal tasks. (a) Adapter, which involves inserting a learnable small network into a pre-trained model; (b) LoRA, which employs a down and up tensor as updated parameters for low-rank approximation (R $\ll$ d), added to the pre-trained model; and (c) our proposed Aurora, which utilizes mode approximation to further reduce the number of trainable parameters added to the pre-trained model. Notably, the red blocks represent trainable parameters, while the blue ones indicate the frozen backbone.
Figure 2: Demonstration of the overall framework. The frozen backbone network is shown in grey. The trainable parameters in color represent: blue for vision tasks, pink for text tasks, and the gradient color for fused modalities. Notably, globally shared parameters are represented in purple.
Figure 3: The answer to how rank $R$ affects Aurora. (a), (b), and (c) show the performance increase accompanied with larger $R$ on three different cross-modal tasks. Notably, our results are divided on two $y$-axes for clear demonstration, where Recall@1 is shown on the left axis and Recall@5/10 are on the right one. (d) compares the parameter scalability with other PETL methods.
Figure 4: Analysis of the impact of the informative context enhancement module.
Figure 5: Visualization of cross-attention map comparisons on Flickr30K, which shows the capability to locate the most semantic-related visual parts for specific words in the text.
...and 8 more figures

Theorems & Definitions (4)

Definition 1
Definition 2
Theorem F.1
Proof F.1

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

TL;DR

Abstract

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (4)