Table of Contents
Fetching ...

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

Tao Chen, Enwei Zhang, Yuting Gao, Ke Li, Xing Sun, Yan Zhang, Hui Li, Rongrong Ji

TL;DR

MMICT introduces a novel fine-tuning paradigm for multi-modal LLMs that leverages in-context visual-guided textual demonstrations via a unified Multi-Modal Hub (M-Hub). The approach enables learning from demonstrations and generating outputs conditioned on textual-guided visual features, addressing the gap between in-context learning gains and traditional fine-tuning. Empirical results across image/video captioning and VQA/VideoQA show consistent improvements over VanillaFine-Tune and Vanilla ICT baselines, with state-of-the-art performance on MSVD video captioning and VideoQA among MM-LLMs. The work highlights the effectiveness of structured cross-modal fusion and robust demonstration design, suggesting broad potential for extending MM-LLMs with in-context tuning to additional modalities and tasks.

Abstract

Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input. Our implementation is available at: https://github.com/KDEGroup/MMICT.

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

TL;DR

MMICT introduces a novel fine-tuning paradigm for multi-modal LLMs that leverages in-context visual-guided textual demonstrations via a unified Multi-Modal Hub (M-Hub). The approach enables learning from demonstrations and generating outputs conditioned on textual-guided visual features, addressing the gap between in-context learning gains and traditional fine-tuning. Empirical results across image/video captioning and VQA/VideoQA show consistent improvements over VanillaFine-Tune and Vanilla ICT baselines, with state-of-the-art performance on MSVD video captioning and VideoQA among MM-LLMs. The work highlights the effectiveness of structured cross-modal fusion and robust demonstration design, suggesting broad potential for extending MM-LLMs with in-context tuning to additional modalities and tasks.

Abstract

Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input. Our implementation is available at: https://github.com/KDEGroup/MMICT.
Paper Structure (26 sections, 6 equations, 3 figures, 10 tables)

This paper contains 26 sections, 6 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of MMICT. M-Hub can output both visual-guided textual features (upper left green part) and instruction-guided visual features (upper right orange part). MMICT learns from visually-guided textual features derived from demonstration examples and generates outputs based on instruction-guided visual features obtained from input queries.
  • Figure 2: Different usages of M-Hub. As demonstrated in (a) and (b), it can function as a uni-modal encoder. Moreover, it can also operate as a multi-modal fusion encoder, as shown in (c), (d), and (e).
  • Figure 3: Case study on (a) visual question answering, (b) image captioning, (c) video question answering, and (d) video captioning. We show the answers generated by VanillaFT, VanillaICT-$\text{B}_{\text{VT}}$ and MMICT in orange, green and blue, respectively.