Table of Contents
Fetching ...

Caption This, Reason That: VLMs Caught in the Middle

Zihan Weng, Lucas Gomez, Taylor Whittington Webb, Pouya Bashivan

TL;DR

This work probes why state-of-the-art vision-language models fail on visual reasoning by evaluating them along core cognitive axes—Perception, Attention, and Memory (PAM). It develops procedurally generated PAM and CVR tasks via the iWISDM environment and introduces vision-text decoupling methods (PC/SC/SC-I) plus LoRA fine-tuning to boost reasoning. Key findings show strong category perception but persistent spatial and attention bottlenecks, with PAM performance strongly predicting CVR success; decoupling and targeted fine-tuning yield robust gains and improve correlations with broader benchmarks like MMMU-Pro and VQAv2. The results offer practical strategies to enhance VLM reasoning and provide a framework for linking core cognitive abilities to complex visual reasoning, informing better design and evaluation of future models.

Abstract

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (CoT) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs substantially improves core cognitive abilities. While this improvement does not translate to large enhancements on challenging, out-of-distribution benchmarks, we show broadly that VLM performance on our datasets strongly correlates with performance on these other benchmarks. Our work provides a detailed analysis of VLM cognitive strengths and weaknesses and identifies key bottlenecks in simultaneous perception and reasoning while also providing an effective and simple solution.

Caption This, Reason That: VLMs Caught in the Middle

TL;DR

This work probes why state-of-the-art vision-language models fail on visual reasoning by evaluating them along core cognitive axes—Perception, Attention, and Memory (PAM). It develops procedurally generated PAM and CVR tasks via the iWISDM environment and introduces vision-text decoupling methods (PC/SC/SC-I) plus LoRA fine-tuning to boost reasoning. Key findings show strong category perception but persistent spatial and attention bottlenecks, with PAM performance strongly predicting CVR success; decoupling and targeted fine-tuning yield robust gains and improve correlations with broader benchmarks like MMMU-Pro and VQAv2. The results offer practical strategies to enhance VLM reasoning and provide a framework for linking core cognitive abilities to complex visual reasoning, informing better design and evaluation of future models.

Abstract

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (CoT) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs substantially improves core cognitive abilities. While this improvement does not translate to large enhancements on challenging, out-of-distribution benchmarks, we show broadly that VLM performance on our datasets strongly correlates with performance on these other benchmarks. Our work provides a detailed analysis of VLM cognitive strengths and weaknesses and identifies key bottlenecks in simultaneous perception and reasoning while also providing an effective and simple solution.

Paper Structure

This paper contains 31 sections, 16 figures, 17 tables.

Figures (16)

  • Figure 1: PAM Dataset. a) Relationship between different cognitive abilities underlying reasoning. b) Different components of the PAM dataset and example tasks from each. Green and blue correspond to Category and Location tasks respectively. A complete list of examples is provided in Appendix Figures \ref{['fig:task-examples']}-\ref{['fig:task-examples-CVR']}.
  • Figure 2: Scatter plots comparing average PAM task performance against average CVR task performance across all models. Each point represents a different model or Qwen2.5-VL-7B LoRA versions. The x-axis shows the average accuracy on the specified PAM task category (averaging Loc and Cat), and the y-axis shows the CVR accuracy score.
  • Figure A.1.1: Example trials from Perception Category (Perc-Cat-R & Perc-Cat-C) and Localization (Perc-Loc-R & Perc-Loc-C) tasks. Each task consists of two variations: Report where the agent is tasked with reporting the object's property; and Compare where the agent is tasked with comparing that property between two objects on separate frames.
  • Figure A.1.2: Example trials from Spatial (Att-Spa-R & Att-Spa-C) and Feature Attention (Att-Feat-R & Att-Feat-C) tasks. Each task consists of two variations: Report, where the agent is tasked with reporting the object's property, and Compare, where the agent is tasked with comparing that property between two objects on separate frames.
  • Figure A.1.3: Example trials from Memory with Distractors Category (Mem-Dis-Cat-R & Mem-Dis-Cat-C) and Memory with Distractors Location (Mem-Dis-Loc-R & Mem-Dis-Loc-C) Memory tasks. Each task consists of two variations: Report, where the agent is tasked with reporting the object's property, and Compare, where the agent is tasked with comparing that property between two objects on separate frames.
  • ...and 11 more figures