Table of Contents
Fetching ...

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

Amartya Bhattacharya

Abstract

Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

Abstract

Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We present, a unified evaluation and augmentation framework benchmarking four architecturally diverse VLMs,CLIP, BLIP, LLaVA, and Qwen3-VL-8B-Thinking,on the Winoground benchmark under plain and scene-graph-augmented regimes. We introduce a dependency-based TextSceneGraphParser (spaCy) extracting subject-relation-object triples, and a Graph Asymmetry Scorer using optimal bipartite matching to inject structural relational priors. Caption ablation experiments (subject-object masking and swapping) reveal that Qwen3-VL-8B-Thinking achieves a group score of 62.75, far above all encoder-based models, while a proposed multi-turn SG filtering strategy further lifts it to 66.0, surpassing prior open-source state-of-the-art. We analyze the capability augmentation tradeoff and find that SG augmentation benefits already capable models while providing negligible or negative gains for weaker baselines. Code: https://github.com/amartyacodes/Inference-Time-Structural-Reasoning-for-Compositional-Vision-Language-Understanding

Paper Structure

This paper contains 28 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Winoground examples. Captions share the same words arranged differently; images share the same visual elements interacting differently. Models must correctly pair each caption with its image.
  • Figure 2: TextSceneGraphParser architecture (spaCy rule-based). Five parallel extraction rules produce triples that are deduplicated and cached per caption string.
  • Figure 3: Caption ablation pipeline. spaCy identifies semantic spans offline; captions are transformed (mask/swap subject, object, or both) and evaluated without SG injection.
  • Figure 4: Qwen3-VL-Thinking pipeline with multi-turn SG filtering. Turn 1 identifies visually relevant SG relations; Turn 2 uses those filtered triples for the final match decision.
  • Figure 5: Scoring mechanism: last-token logits for "yes"/"no" are softmax-normalized to yield $P(\text{yes}\mid I, c)$.