Table of Contents
Fetching ...

Understanding Task Transfer in Vision-Language Models

Bhuvan Sachdeva, Karan Uppal, Abhinav Java, Vineeth N. Balasubramanian

TL;DR

The paper investigates how finetuning a vision-language model on a single perception task affects zero-shot performance on other perception tasks. It introduces Perfection Gap Factor (PGF) as a ceiling-aware, normalized measure of transfer, and systematically studies cross-task transfer across three model scales and 13 BLINK perception tasks. The analysis reveals structured transfer, including positive and negative patterns, task cliques, and personas, with positive transfer growing with model size and being transferable to spatio-temporal tasks. PGF-guided data selection demonstrates practical gains in finetuning efficiency, sometimes outperforming direct target-task supervision, offering actionable guidance for robust and economical VLM adaptation.

Abstract

Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

Understanding Task Transfer in Vision-Language Models

TL;DR

The paper investigates how finetuning a vision-language model on a single perception task affects zero-shot performance on other perception tasks. It introduces Perfection Gap Factor (PGF) as a ceiling-aware, normalized measure of transfer, and systematically studies cross-task transfer across three model scales and 13 BLINK perception tasks. The analysis reveals structured transfer, including positive and negative patterns, task cliques, and personas, with positive transfer growing with model size and being transferable to spatio-temporal tasks. PGF-guided data selection demonstrates practical gains in finetuning efficiency, sometimes outperforming direct target-task supervision, offering actionable guidance for robust and economical VLM adaptation.

Abstract

Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

Paper Structure

This paper contains 24 sections, 3 equations, 39 figures, 4 tables, 1 algorithm.

Figures (39)

  • Figure 1: One finetune, many fates: Finetuning Qwen-2.5-VL 32B on perception tasks creates a structured map of transfer capabilities. (The list of perception tasks considered can be found in Table \ref{['tab:task_classification']}.)
  • Figure 2: PGF Heatmaps for Qwen-2.5-VL model family (3B, 7B, 32B).
  • Figure 3: Average positive malleability trends across granular and perceptual levels. We observe that positive malleability increases with model size and generally low-level benefit the most from finetuning. Detailed category-wise heatmaps are provided in the supplementary material.
  • Figure 4: Average positive transferability trends across granular and perceptual levels. We observe that positive transferability increases with model size and generally low-level and image-level are highly transferable. Detailed category-wise heatmaps are provided in the supplementary material.
  • Figure 5: Task transferability trends across model sizes in Qwen-2.5-VL. As expected, as model size increases, the average positive transferability increases.
  • ...and 34 more figures

Theorems & Definitions (3)

  • Definition 1: Task Transferability
  • Definition 2: Malleability
  • Definition 3: Task Clique