Table of Contents
Fetching ...

Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen

TL;DR

This work tackles the data and cost bottlenecks of visual instruction tuning for large vision-language models by proposing ViFT, a visual instruction-free fine-tuning framework. ViFT splits multimodal task solving into two independently learnable abilities—visual perception from image captions and task solving from text instructions—and then fuses them at inference via steering vectors. Across MathVista, MathVerse, and MathVision, ViFT and its caption-augmented variant ViFT-A achieve state-of-the-art results with substantially less training data, demonstrating strong data efficiency and robust multimodal fusion. The approach offers a scalable, low-cost path to enhancing LVLMs while preserving core language capabilities inherited from the backbone LLM.

Abstract

Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks, with rather less training data. Our code and data will be publicly released.

Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

TL;DR

This work tackles the data and cost bottlenecks of visual instruction tuning for large vision-language models by proposing ViFT, a visual instruction-free fine-tuning framework. ViFT splits multimodal task solving into two independently learnable abilities—visual perception from image captions and task solving from text instructions—and then fuses them at inference via steering vectors. Across MathVista, MathVerse, and MathVision, ViFT and its caption-augmented variant ViFT-A achieve state-of-the-art results with substantially less training data, demonstrating strong data efficiency and robust multimodal fusion. The approach offers a scalable, low-cost path to enhancing LVLMs while preserving core language capabilities inherited from the backbone LLM.

Abstract

Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks, with rather less training data. Our code and data will be publicly released.

Paper Structure

This paper contains 41 sections, 2 equations, 4 figures, 20 tables.

Figures (4)

  • Figure 1: A comparison of ViFT with other instruction-tuned LVLM in terms of the training data size and average benchmark performance on MathVista, Mathvision, and MathVerse. ViFT is fine-tuned without any visual instruction data. For ViFT-A, we add 7% additional simple VQA data.
  • Figure 2: Compared to visual instruction tuning, ViFT first learns disentangled individual abilities through ability-specific fine-tuning. During inference, given a visual instruction, we extract the disentangled ability vectors through different modality inputs, and merge them into the fused vector for guiding the LVLM to generate the output.
  • Figure 3: The impact of different hyperparameters.
  • Figure 4: Efficiency test and scaling test for ViFT.