Table of Contents
Fetching ...

Explore How to Inject Beneficial Noise in MLLMs

Ruishu Zhu, Sida Huang, Ziheng Jiao, Hongyuan Zhang

TL;DR

This work reformulates the reasoning process of MLLMs from a variational inference perspective, upon which it design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise.

Abstract

Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2\%$ additional parameters. The relevant code is uploaded in the supplementary.

Explore How to Inject Beneficial Noise in MLLMs

TL;DR

This work reformulates the reasoning process of MLLMs from a variational inference perspective, upon which it design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise.

Abstract

Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about additional parameters. The relevant code is uploaded in the supplementary.

Paper Structure

This paper contains 24 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Pipeline of MuNG. (a) The overall framework of noise injection in MLLMs. The proposed MuNG is inserted between the feature alignment layer and the LLM decoder, injecting task-adaptive beneficial noise into the visual representations. (b) The architecture of the multimodal noise generator based on cross-attention. A random signal $\epsilon$ is sampled from a standard normal distribution and combined with the mean and variance obtained via cross-attention to generate the final noise. In the figure, $\odot$ denotes the Hadamard product, $\oplus$ denotes matrix or vector addition, and $\otimes$ denotes matrix multiplication.
  • Figure 2: Visualization of the generated noise injected into high-dimensional visual features. The top three rows show the input text, images, and noise module's attention maps; the bottom two show visual-text importance maps before and after noise injection. The attention maps indicate that MuNG can effectively identify and selectively suppress semantically irrelevant or unmentioned regions in the image. The relative importance maps further highlight that the noise enhances the representation of image regions that are more crucial for answering the question.
  • Figure 3: Visualization of generated noise injected into high-dimensional visual features.