Table of Contents
Fetching ...

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

Nikos Theodoridis, Reenu Mohandas, Ganesh Sistu, Anthony Scanlan, Ciarán Eising, Tim Brophy

TL;DR

The results show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept, and improve the understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.

Abstract

The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

TL;DR

The results show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept, and improve the understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.

Abstract

The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.
Paper Structure (33 sections, 7 equations, 7 figures, 7 tables)

This paper contains 33 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Framework for tracking visual concept representations across VLM architectures. Left: The model receives counterfactual input pairs that differ only by a specific visual concept (e.g., the presence of a pedestrian). Middle: Linear probes are trained on intermediate activations to detect if the concept is linearly encoded within the Vision Encoder, Projector, or LLM. Right: Comparing probe accuracy to model output reveals two distinct failure modes: Perceptual Failure, where visual information is not linearly encoded in the activations, and Cognitive Failure, where the information is encoded (high probe accuracy) but the model fails to align it with language semantics, resulting in an incorrect answer.
  • Figure 2: Counterfactual sets of images representing four basic visual concepts:Presence, Count, Spatial Relatioinship, Orientation
  • Figure 3: Activation extraction methodology. a) We apply average pooling to all patch vectors to obtain a single vector representation of the intermediate activations. b) We split the image at a selected point and apply average pooling to the left and right regions independently. We then concatenate the resulting vectors to form a single representation of the intermediate activations while retaining minimal spatial structure. c) We apply average pooling to the visual token vectors within the and concatenate the result with the activation of the last token.
  • Figure 4: Linear separability of visual concepts across models and layers based on average-pooled activations. Linear separability for each model, layer, and distance is measured as the average test-set accuracy across ten linear probes trained under the same setting.
  • Figure 5: Linear separability of visual concepts across models and layers based on region-pooled activations. Linear separability for each model, layer, and distance is measured as the average test-set accuracy across ten linear probes trained under the same setting.
  • ...and 2 more figures