Table of Contents
Fetching ...

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Shivam Chandhok, Wan-Cyuan Fan, Leonid Sigal

TL;DR

The paper addresses the gap between strong end-task performance of vision-language models and their ability to perform fundamental visual tasks. It introduces a three-space diagnostic framework—visual latent, vision-language projection latent, and language response space—and uses frozen feature probes and VQA-style prompts across multiple open-source models and diagnostic datasets to locate bottlenecks. Key findings show that fine-grained recognition and spatial understanding are bottlenecked mainly by the projection from latent spaces to the language decoder and by the visual encoder’s limitations, while counting benefits from latent representations but falters in the final response. These insights point to targeted improvements in projection layers and visual encoders, guiding future VLM development toward more robust and interpretable multimodal understanding.

Abstract

Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, also lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks: object classification, understanding spatial arrangement, and ability to delineate individual object instances (through counting), by constructing a series of tests that probe which components of design, specifically, maybe lacking. Importantly, we go significantly beyond the current benchmarks, that simply measure final performance of VLM, by also comparing and contrasting it to performance of probes trained directly on features obtained from visual encoder (image embeddings), as well as intermediate vision-language projection used to bridge image-encoder and LLM-decoder ouput in many SoTA models (e.g., LLaVA, BLIP, InstructBLIP). In doing so, we uncover nascent shortcomings in VLMs response and make a number of important observations which could help train and develop more effective VLM models in future.

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

TL;DR

The paper addresses the gap between strong end-task performance of vision-language models and their ability to perform fundamental visual tasks. It introduces a three-space diagnostic framework—visual latent, vision-language projection latent, and language response space—and uses frozen feature probes and VQA-style prompts across multiple open-source models and diagnostic datasets to locate bottlenecks. Key findings show that fine-grained recognition and spatial understanding are bottlenecked mainly by the projection from latent spaces to the language decoder and by the visual encoder’s limitations, while counting benefits from latent representations but falters in the final response. These insights point to targeted improvements in projection layers and visual encoders, guiding future VLM development toward more robust and interpretable multimodal understanding.

Abstract

Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, also lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks: object classification, understanding spatial arrangement, and ability to delineate individual object instances (through counting), by constructing a series of tests that probe which components of design, specifically, maybe lacking. Importantly, we go significantly beyond the current benchmarks, that simply measure final performance of VLM, by also comparing and contrasting it to performance of probes trained directly on features obtained from visual encoder (image embeddings), as well as intermediate vision-language projection used to bridge image-encoder and LLM-decoder ouput in many SoTA models (e.g., LLaVA, BLIP, InstructBLIP). In doing so, we uncover nascent shortcomings in VLMs response and make a number of important observations which could help train and develop more effective VLM models in future.
Paper Structure (9 sections, 2 figures, 3 tables)

This paper contains 9 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Different from previous work which analyse VLMs as a whole on given task (left), we propose to look at performance of VLMs in terms of intermediate spaces that represent knowledge as it is processed through the VLM network.
  • Figure 2: Overview of our proposed approach for analysing visual understanding capabilities of VLMs. Specifically, we analyse the three spaces within a VLM i.e visual latent, text latent and response space to get a nuanced understanding of what aspects of visual information are captured within VLMs, how visual knowledge flows through the network and where should improvements be targetted to alleviate the deficiencies of VLMs.