Table of Contents
Fetching ...

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Yichi Zhang, Yinpeng Dong, Siyuan Zhang, Tianzan Min, Hang Su, Jun Zhu

TL;DR

This work tackles the efficiency gap in adapting diverse Multimodal Large Language Models (MLLMs) by proposing Transferable Visual Prompting (TVP), which learns a shared visual prompt on one model that transfers to others. TVP introduces two complementary strategies—Feature Consistency Alignment (FCA) to preserve task-agnostic pre-trained knowledge and Task Semantics Enrichment (TSE) to inject task-specific semantics guided by CLIP—resulting in improved cross-model performance across 10 diverse tasks and 6 MLLMs. Empirical results show TVP outperforms traditional visual prompting and enables gains through model ensembling, data-scale resilience, cross-dataset generalization, and robustness to common image corruptions, while maintaining favorable computational efficiency. The findings support a practical, PaaS-aligned approach for plug-in prompts capable of benefiting multiple models without per-model fine-tuning, with broad implications for deployable multimodal systems.

Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

TL;DR

This work tackles the efficiency gap in adapting diverse Multimodal Large Language Models (MLLMs) by proposing Transferable Visual Prompting (TVP), which learns a shared visual prompt on one model that transfers to others. TVP introduces two complementary strategies—Feature Consistency Alignment (FCA) to preserve task-agnostic pre-trained knowledge and Task Semantics Enrichment (TSE) to inject task-specific semantics guided by CLIP—resulting in improved cross-model performance across 10 diverse tasks and 6 MLLMs. Empirical results show TVP outperforms traditional visual prompting and enables gains through model ensembling, data-scale resilience, cross-dataset generalization, and robustness to common image corruptions, while maintaining favorable computational efficiency. The findings support a practical, PaaS-aligned approach for plug-in prompts capable of benefiting multiple models without per-model fine-tuning, with broad implications for deployable multimodal systems.

Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.
Paper Structure (34 sections, 7 equations, 5 figures, 11 tables)

This paper contains 34 sections, 7 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: (a) Illustration of problem setting: We aim to improve the performance of different MLLMs on a specific task with a set of shared parameters. This is achieved by exploiting the transferability of the visual prompts trained on one model and using them on other models. (b) Demonstration of the effect: We show the partial results on SVHN svhn with the visual prompt trained on MiniGPT-4 zhu2023minigpt and tested on InstructBLIP Dai2023InstructBLIP, BLIP2 li2023blip and BLIVA hu2023bliva. Compared with the existing visual prompting methods bahng2022exploringwu2022unleashing, the proposed Transferable Visual Prompting (TVP) improves different models with larger margins. Detailed results are in \ref{['sec:main_results']}. ZS is for zero-shot inference when non-prompted.
  • Figure 2: Overview of our proposed Transferable Visual Prompting (TVP) method for adapting MLLMs. TVP optimizes a visual prompt on a single MLLM towards a downstream task. Feature Consistency Alignment (FCA) and Task Semantic Enrichment (TSE) are proposed to make learned visual prompts more transferable and benefit more unseen MLLMs to improve on the same task.
  • Figure 3: t-SNE visualization of visual features from InstructBLIP and BLIVA on CIFAR-10 with and without the visual prompt, which is trained on MiniGPT-4 using VP bahng2022exploring. When the images are prompted, the visual features of different categories get mixed together, leading to performance degradation.
  • Figure 4: GradCAM gradcam of VPGTrans zhang2023transfer on 3 different tasks. TVP encourages the model to attend to task-related objects.
  • Figure 5: Curves of average performance as the training data scale changes. TVP can effectively enhance the performance of different models even with only 1% of the data, and its overall performance improves as the data size increases.