Table of Contents
Fetching ...

CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models

Kaizhen Tan

Abstract

Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.

CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models

Abstract

Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.
Paper Structure (38 sections, 8 equations, 5 figures, 7 tables)

This paper contains 38 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Creg pipeline overview. An image with reference/target objects is fed through a VLM. Contrastive Grad$\times$Act signals from multiple layers are aggregated and projected into a reference-centered polar coordinate system, producing the compass distribution $\hat{P}(\theta)$.
  • Figure 2: Intuition behind our evaluation metrics. (a) DAE: angular distance between the compass peak and the true $A \to B$ direction. (b) EA: whether the peak falls within $\pm 45^\circ$ of the truth. (c) COS: occluding the true-direction sector should cause a larger confidence drop than occluding the opposite sector.
  • Figure 3: Compass distribution $\hat{P}(\theta)$ examples. Red arrow = true direction. (a) Strong peak at correct direction. (b) Diffuse: no clear directional signal. (c) Wrong peak direction.
  • Figure 4: Per-class analysis on VSR. (a) VLM accuracy is much higher for horizontal relations than vertical ones, with below at only 30%. (b) DAE shows a similar pattern---vertical relations have higher angular error. Dashed lines indicate random baselines.
  • Figure 5: Success vs. failure illustration. Green arrow = true direction; colored arrow = Creg peak. Left: CREG correctly identifies the right direction (DAE $= 15^\circ$). Right: CREG peak points opposite to the true direction (DAE $= 155^\circ$).