Table of Contents
Fetching ...

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram

Abstract

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

Abstract

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

Paper Structure

This paper contains 37 sections, 17 figures, 15 tables.

Figures (17)

  • Figure 1: Failure of visual transformation reasoning across visual domains. Given a pair of images, models are asked to determine whether they depict the same object under transformations of rotation, scale, or identity. While performance remains near-perfect on natural images (Art, Photo), accuracy drops sharply on abstract and symbolic images (Symbolic and Semantic Sketches), particularly for rotation. Results shown are for Gemini-2.5-Pro gemini, with similar trends across evaluated MLLMs.
  • Figure 2: Cosine similarity between features extracted from different vision encoders on pairs of images under rotation. Select Omniglot scripts are shown in orange, while Times New Roman and Handwritten English are shown in blue and purple respectively. Across all encoders, similarity decreases with increasing rotation angle, with DINOv2 showing the steepest drop and SigLIP and Qwen2.5-VL-7B maintaining relatively higher similarity.
  • Figure 3: Failure cases on the identity task (Sec. \ref{['sec:identity_exp']}) for Qwen2.5-VL-7B. We show four randomly selected examples from Omniglot dataset where the model incorrectly predicts that two identical inputs correspond to different characters.
  • Figure 4: Datasets used in our evaluation. Omniglot omniglot contains handwritten binary characters from $50$ diverse scripts. Times New Roman timesnewroman provides standardized English characters rendered in a fixed typeface. Handwritten English handwritten_english_characters_digits includes handwritten characters from the English alphabet. PACS PACS contains images of common object categories (e.g., guitar, dog, elephant) across four visual domains: Photograph, Art, Cartoon, and Sketch. Together, these datasets allow us to evaluate transformation invariance in MLLMs across scripts, visual styles, and images with varying levels of semantic richness.
  • Figure 5: Examples from the Omniglot dataset
  • ...and 12 more figures