Table of Contents
Fetching ...

Multimodal Infusion Tuning for Large Models

Hao Sun, Yu Song, Xinyao Yu, Jiaqing Liu, Yen-Wei Chen, Lanfen Lin

TL;DR

This work tackles multimodal integration in large language models with a parameter-efficient approach that keeps foundation models frozen. It introduces Multimodal Infusion Tuning (MiT), which decouples self-attention and progressively infuses multimodal signals into K and V, alongside an adaptive head-level rescaling to stabilize cross-modal interactions, all while tuning only about 2.5% of parameters. Across seven datasets covering referring segmentation, image-text classification, and sentiment analysis, MiT achieves state-of-the-art results with low compute (approximately 0.47 TFLOPs) and demonstrates robust multimodal reasoning and in-context understanding. The method is versatile, supporting image, acoustic, and facial cues, and shows practical impact through improved efficiency and capability in complex multimodal scenarios.

Abstract

Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the attention head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden and only 2.5\% parameters are tunable. We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.

Multimodal Infusion Tuning for Large Models

TL;DR

This work tackles multimodal integration in large language models with a parameter-efficient approach that keeps foundation models frozen. It introduces Multimodal Infusion Tuning (MiT), which decouples self-attention and progressively infuses multimodal signals into K and V, alongside an adaptive head-level rescaling to stabilize cross-modal interactions, all while tuning only about 2.5% of parameters. Across seven datasets covering referring segmentation, image-text classification, and sentiment analysis, MiT achieves state-of-the-art results with low compute (approximately 0.47 TFLOPs) and demonstrates robust multimodal reasoning and in-context understanding. The method is versatile, supporting image, acoustic, and facial cues, and shows practical impact through improved efficiency and capability in complex multimodal scenarios.

Abstract

Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the attention head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden and only 2.5\% parameters are tunable. We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.
Paper Structure (17 sections, 10 equations, 5 figures, 6 tables)

This paper contains 17 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The comparison between previous multimodal tuning methods(left) ren2023pixellm and our proposed multimodal infusion tuning(right). The referring segmentation task is taken as an example. Previous methods usually treat visual embeddings as tokens, which are prefixed to textual tokens and some PEFT adapters are employed to tune the LLMs on downstream tasks. In our proposed multimodal infusion tuning, we infuse the visual information into each textual token in a linear manner, which is more fine-grained and dose not introduce extra tokens to pretrained LLMs.
  • Figure 2: The pipeline of our proposed method(left) and the detailed structure of MiT(right). The referring segmentation is shown as the example, which is part of a broader framework that includes several other tasks, such as image-text classification and sentiment analysis. In MiT, we infuse the visual information in both self-attention and feedforward module. The procedure is kept linear for computational burden consideration.
  • Figure 3: The formation of our employed dataset and tasks: referring segmentation(left), image-text classification(middle), and sentiment analysis(right). For different dataset and tasks, we design different templates, so as to excavate the capabilities of LLMs obtained in pretraining.
  • Figure 4: The impact of text length to last-token and < SEG>-token schema on referring segmentation. The experiments are conducted on the testA set of RefCOCO.
  • Figure 5: The case visualization of complex reasoning(up) and multimodal in-context understanding(down). In complex reasoning, we describe objects more implicitly instead of identifying the object directly. For multimodal in-context understanding, we conduct the experiments in a conversation manner, which also get great segmentation results.