Multimodal Infusion Tuning for Large Models
Hao Sun, Yu Song, Xinyao Yu, Jiaqing Liu, Yen-Wei Chen, Lanfen Lin
TL;DR
This work tackles multimodal integration in large language models with a parameter-efficient approach that keeps foundation models frozen. It introduces Multimodal Infusion Tuning (MiT), which decouples self-attention and progressively infuses multimodal signals into K and V, alongside an adaptive head-level rescaling to stabilize cross-modal interactions, all while tuning only about 2.5% of parameters. Across seven datasets covering referring segmentation, image-text classification, and sentiment analysis, MiT achieves state-of-the-art results with low compute (approximately 0.47 TFLOPs) and demonstrates robust multimodal reasoning and in-context understanding. The method is versatile, supporting image, acoustic, and facial cues, and shows practical impact through improved efficiency and capability in complex multimodal scenarios.
Abstract
Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the attention head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden and only 2.5\% parameters are tunable. We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.
