Table of Contents
Fetching ...

Visual Instruction Bottleneck Tuning

Changdae Oh, Jiatong Li, Shawn Im, Sharon Li

TL;DR

This work tackles robustness of multimodal large language models (MLLMs) under distribution shifts by adopting a representation-centric approach grounded in the information bottleneck (IB) principle. It derives a tractable variational lower bound for IB in autoregressive multimodal models and instantiates Visual Instruction Bottleneck Tuning (Vittle), inserting a bottleneck layer inside the LLM to learn minimal sufficient representations via posterior-prior KL regularization. Theoretical justification ties Vittle to an information-theoretic robustness metric (EMID) and empirical results across 45 datasets and 30 shift scenarios demonstrate improved robustness to perturbations and long-tail distributions without sacrificing performance on standard benchmarks. The method is model-agnostic, scales to larger backbones, and incurs modest training overhead, offering a practical, principled route to robust, instruction-tuned MLLMs. Limitations include dependence on high-quality target outputs and potential impacts on fine-grained recognition tasks, motivating future work on noisy annotations and domain-generalization settings.

Abstract

Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the generalization and robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of multiple MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.

Visual Instruction Bottleneck Tuning

TL;DR

This work tackles robustness of multimodal large language models (MLLMs) under distribution shifts by adopting a representation-centric approach grounded in the information bottleneck (IB) principle. It derives a tractable variational lower bound for IB in autoregressive multimodal models and instantiates Visual Instruction Bottleneck Tuning (Vittle), inserting a bottleneck layer inside the LLM to learn minimal sufficient representations via posterior-prior KL regularization. Theoretical justification ties Vittle to an information-theoretic robustness metric (EMID) and empirical results across 45 datasets and 30 shift scenarios demonstrate improved robustness to perturbations and long-tail distributions without sacrificing performance on standard benchmarks. The method is model-agnostic, scales to larger backbones, and incurs modest training overhead, offering a practical, principled route to robust, instruction-tuned MLLMs. Limitations include dependence on high-quality target outputs and potential impacts on fine-grained recognition tasks, motivating future work on noisy annotations and domain-generalization settings.

Abstract

Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the generalization and robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of multiple MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.

Paper Structure

This paper contains 53 sections, 5 theorems, 28 equations, 14 figures, 16 tables.

Key Result

Proposition 3.2

Let $P_{\Theta}$ be an MLLM that maps $X=\{X_{v},X_{t}\}$ to $Z=\{Z_v,Z_t\}$, and then subsequently maps $Z$ to $Y_{\Theta}$. Given joint distributions $P_{XY}=P_{X}\times P_{Y|X}$ and $Q_{XY}=Q_{X}\times Q_{Y|X}$ (resp. $P_{ZY}$ and $Q_{ZY}$), by assuming consistent conditionals over $Z_v|Z_t$, $Z_

Figures (14)

  • Figure 1: Illustration of distribution shifts for an MLLM (a) and performance degeneration and embedding shifts of the MLLM (b). An MLLM (LLaVA-v1.5-7B) receives arbitrary queries that might be visually and/or textually perturbed by unexpected noise. These distribution shifts result in performance drops, as shown in the middle bar plot. A visualization of intermediate layer representations of the MLLM on LLaVA-Bench-COCO and its variants indicates that MLLM fails to learn a proper level of invariance to generalize multimodal queries in the representation space.
  • Figure 2: Vittle architecture. We insert a learnable bottleneck layer $g_{\phi}=\{g_{\phi_{v}},g_{\phi_{t}}\}$ on top of $l$ blocks of LLM backbone (i.e., LLM-stem $f_{\theta^{l}}$) to estimate posterior distributions of token embeddings. After obtaining a sample per token $\{\tilde{z}_{v},\tilde{z}_{t}\}$ from posteriors, we interpolate it with a pre-bottlenecked token representation $\{z_{v},z_{t}\}$ and pass it through the remaining LLM blocks (i.e., LLM-head $f_{\theta^{l+}}$).
  • Figure 3: Object hallucination detection performance on POPE variants. We enumerate the hallucination detection accuracy of each method on nine versions of perturbed samples, and observe consistent gains by Vittle (highlighted by green numbers of relative improvement from baseline).
  • Figure 4: Open-ended QA performance on LB-COCO variants. We enumerate the relative preference score of responses from each model on 18 version of perturbed samples, and observe consistent gains by Vittle (especially for the Vittle (F)) on most of the textual (top), and joint (bottom) perturbations (results on visual perturbations are deferred to Appendix \ref{['sec:apdx:result']}).
  • Figure 5: Evaluation under varying perturbation severity.Vittle achieves better performance, especially on severe perturbations.
  • ...and 9 more figures

Theorems & Definitions (11)

  • Definition 3.1: EMID
  • Proposition 3.2: EMID upper bound
  • Definition E.1: Mutual Information (MI)
  • Definition E.2: Effective Mutual Information (EMI) oh2025understanding
  • Definition E.3: EMID
  • Lemma E.4: Lemma 1 from Shui et al. shui2022novel
  • Lemma E.5: Conditional entropy bound oh2025understanding
  • Lemma E.6
  • proof
  • Proposition E.7: EMID upper bound
  • ...and 1 more