Table of Contents
Fetching ...

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, Ziqiao Ma, Freda Shi

TL;DR

VLM-Lens addresses the need for systematic probing and interpretation of vision-language models by enabling extraction of intermediate representations across layers through a unified YAML-configurable interface. It supports 16 base VLMs and 30+ variants, with per-model environment setups and a SQL-based database to organize outputs for flexible analysis. The authors demonstrate two analyses—a probing framework for primitive concept competence and a Stroop-like color grounding task—showing that hidden representations encode task-relevant information and reveal layer- and model-dependent differences. Inference and memory benchmarking across models highlight practical trade-offs for deployment, underscoring the toolkit’s value for rigorous, beyond-accuracy evaluation and iterative improvement of multimodal systems.

Abstract

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

TL;DR

VLM-Lens addresses the need for systematic probing and interpretation of vision-language models by enabling extraction of intermediate representations across layers through a unified YAML-configurable interface. It supports 16 base VLMs and 30+ variants, with per-model environment setups and a SQL-based database to organize outputs for flexible analysis. The authors demonstrate two analyses—a probing framework for primitive concept competence and a Stroop-like color grounding task—showing that hidden representations encode task-relevant information and reveal layer- and model-dependent differences. Inference and memory benchmarking across models highlight practical trade-offs for deployment, underscoring the toolkit’s value for rigorous, beyond-accuracy evaluation and iterative improvement of multimodal systems.

Abstract

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

Paper Structure

This paper contains 15 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An example use case of VLM-Lens, where intermediate output from Qwen2-VL wang2024qwen2 is extracted for probing.
  • Figure 2: Evaluation Accuracy on our probing dataset by model, layer, and split. Main refers to probing on the regular data, while control stands for probing using data with random labels. The number of asterisks represents the significance level of the Z-test for Bernoulli variables (***: $p=.001$, **: $p=.01$, *: $p=.05$).
  • Figure 3: Cosine similarity between Stroop task images and primitive color concepts. Results are shown as a function of model layer (x-axis) and number of PCA components retained (y-axis), with orange surfaces indicating matching conditions and blue surfaces indicating mismatching conditions when considering different aspects.
  • Figure 4: Example images used in the Stroop Task.