Table of Contents
Fetching ...

Learning to Instruct for Visual Instruction Tuning

Zhihan Zhou, Feng Hong, Jiaan Luo, Jiangchao Yao, Dongsheng Li, Bo Han, Ya Zhang, Yanfeng Wang

TL;DR

This work tackles overfitting and shortcut learning in visual instruction tuning for multimodal LLMs by introducing Learning to InstrucT (L2T), which jointly learns to generate image-conditioned instructions and responses. The approach adds a template-removal mechanism to focus learning on meaningful visual content and applies the Learn-to-Instruct objective during finetuning, keeping pretraining data unchanged. Empirically, L2T achieves up to $8.5\%$ overall improvement across 16 benchmarks, with pronounced gains in OCR and image captioning, and substantial reductions in hallucinations across multiple evaluation suites. The method is orthogonal to existing MLLM improvements, incurs negligible computational overhead, and broadly enhances visual grounding and data efficiency, offering a scalable path to safer, more reliable multimodal models.

Abstract

We propose L2T, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, L2T achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs. Github code: https://github.com/Feng-Hong/L2T.

Learning to Instruct for Visual Instruction Tuning

TL;DR

This work tackles overfitting and shortcut learning in visual instruction tuning for multimodal LLMs by introducing Learning to InstrucT (L2T), which jointly learns to generate image-conditioned instructions and responses. The approach adds a template-removal mechanism to focus learning on meaningful visual content and applies the Learn-to-Instruct objective during finetuning, keeping pretraining data unchanged. Empirically, L2T achieves up to overall improvement across 16 benchmarks, with pronounced gains in OCR and image captioning, and substantial reductions in hallucinations across multiple evaluation suites. The method is orthogonal to existing MLLM improvements, incurs negligible computational overhead, and broadly enhances visual grounding and data efficiency, offering a scalable path to safer, more reliable multimodal models.

Abstract

We propose L2T, an advancement of visual instruction tuning (VIT). While VIT equips Multimodal LLMs (MLLMs) with promising multimodal capabilities, the current design choices for VIT often result in overfitting and shortcut learning, potentially degrading performance. This gap arises from an overemphasis on instruction-following abilities, while neglecting the proactive understanding of visual information. Inspired by this, L2T adopts a simple yet effective approach by incorporating the loss function into both the instruction and response sequences. It seamlessly expands the training data, and regularizes the MLLMs from overly relying on language priors. Based on this merit, L2T achieves a significant relative improvement of up to 9% on comprehensive multimodal benchmarks, requiring no additional training data and incurring negligible computational overhead. Surprisingly, L2T attains exceptional fundamental visual capabilities, yielding up to an 18% improvement in captioning performance, while simultaneously alleviating hallucination in MLLMs. Github code: https://github.com/Feng-Hong/L2T.

Paper Structure

This paper contains 20 sections, 4 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Performance comparison on a broad range of 16 tasks between L2T and VIT using different models, including TinyLLaVA Qwen2-0.5B DBLP:journals/corr/abs-2402-14289, LLaVA-1.5 Vicuna-7B DBLP:conf/cvpr/LiuLLL24, and LLaVA-1.5 Vicuna-13B DBLP:conf/cvpr/LiuLLL24. The pretraining phase uses the LLaVA-pretrain-558k dataset, while the fine-tuning phase employs the LLaVA-mix-665k dataset.
  • Figure 2: An example where a pure language model provides correct answers based only on language priors, without relying on visual content. This shows that learning to generate responses alone cannot prevent the model from taking shortcuts by ignoring visual content and relying solely on textual instructions.
  • Figure 3: The model architecture using LLaVA as an example, and the data flow for generating responses from images and instructions.
  • Figure 4: Illustration of L2T. In addition to learning to generate responses like VIT, L2T also learns to generate instructions that exclude templates.
  • Figure 5: The visual contribution ($\mathrm{VC}$) distributions of VIT and L2T on the VQAv2 training data and DocVQA test data. Experiments are based on TinyLLaVA Qwen2-0.5B.
  • ...and 7 more figures