Table of Contents
Fetching ...

Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs

Darryl Hannan, John Cooper, Dylan White, Yijing Watkins

TL;DR

This work investigates why vision capabilities in vision-language models lag behind language capabilities by introducing a synthetic benchmark and a suite of metrics to quantify visual redundancy. Through zero-shot and fine-tuning experiments on Molmo and Llama 3.2, the authors reveal that visual information is broadly distributed across tokens, yielding high redundancy that impedes performance on complex tasks; task complexity, however, correlates with reduced compressibility and the need for more specialized representations. They deploy SVD-based alignment, probing, and ablation analyses to map how information flows across unimodal and multimodal spaces and how fine-tuning reshapes these representations. The findings show that fine-tuning mainly shifts text and multimodal subspaces, with the type of downstream complexity (grounding vs. spatial reasoning) shaping the directional changes, offering practical guidance for training data and compression strategies for next-generation VLLMs.

Abstract

Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.

Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs

TL;DR

This work investigates why vision capabilities in vision-language models lag behind language capabilities by introducing a synthetic benchmark and a suite of metrics to quantify visual redundancy. Through zero-shot and fine-tuning experiments on Molmo and Llama 3.2, the authors reveal that visual information is broadly distributed across tokens, yielding high redundancy that impedes performance on complex tasks; task complexity, however, correlates with reduced compressibility and the need for more specialized representations. They deploy SVD-based alignment, probing, and ablation analyses to map how information flows across unimodal and multimodal spaces and how fine-tuning reshapes these representations. The findings show that fine-tuning mainly shifts text and multimodal subspaces, with the type of downstream complexity (grounding vs. spatial reasoning) shaping the directional changes, offering practical guidance for training data and compression strategies for next-generation VLLMs.

Abstract

Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.
Paper Structure (35 sections, 7 equations, 10 figures, 2 tables)

This paper contains 35 sections, 7 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Visual compression trends across layers of Molmo for our proposed synthetic dataset.
  • Figure 2: Spearman correlation between various compression metrics and various visual attributes across layers of Molmo for our proposed synthetic dataset.
  • Figure 3: Spearman correlation between various compression metrics and various visual attributes across layers of Molmo for our COCO.
  • Figure 4: Linear probe performance on various visual attributes using Molmo on our synthetic dataset.
  • Figure 5: Linear probe performance on various visual attributes using Molmo on COCO.
  • ...and 5 more figures