TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Yikun Zong; Cheston Tan

TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Yikun Zong, Cheston Tan

TL;DR

The work addresses the challenge that Vision-Language Models struggle with continuous geometric reasoning in Tangram tasks, failing to achieve precise coordinate alignment. It introduces a human-inspired framework that mimics mental rotation and iterative, feedback-driven refinement, plus a test-time verifier-refiner loop that leverages in-context learning and reward-guided feedback without retraining. The study demonstrates that across five VLMs, $IoU$ performance on single-piece and two-piece Tangram tasks remains far below humans, but the proposed loop raises $IoU$ from about 0.63 to 0.93 on medium-triangle cases, showing the practical potential of self-improvement in continuous spatial domains. Overall, the results establish a path toward self-improving AI for geometric reasoning by integrating geometry-aware feedback into inference-time processes, with broader implications for robotics and embodied AI.

Abstract

Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.

TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

TL;DR

performance on single-piece and two-piece Tangram tasks remains far below humans, but the proposed loop raises

from about 0.63 to 0.93 on medium-triangle cases, showing the practical potential of self-improvement in continuous spatial domains. Overall, the results establish a path toward self-improving AI for geometric reasoning by integrating geometry-aware feedback into inference-time processes, with broader implications for robotics and embodied AI.

Abstract

Paper Structure (20 sections, 1 equation, 3 figures, 5 tables, 3 algorithms)

This paper contains 20 sections, 1 equation, 3 figures, 5 tables, 3 algorithms.

Introduction
Contributions
Related Work
Methodology
Metrics and Inference Protocol
Dataset and Tasks (Continuity-Space Protocol)
Test-Time Self-Improvement via Reward-Guided Refinement
Reward (what we actually optimize).
Self-refinement loop mechanics (training-free).
Deterministic local refinement.
Results and Analysis
Part I: Cross-Model Comparison on Spatial Reasoning
Part II: Spatial Arrangement (Two-Piece Composition)
Part III: Test-Time Self-Improvement via Reward-Guided Refinement
Findings.
...and 5 more sections

Figures (3)

Figure 1: Overall dataset construction pipeline from SVG $\rightarrow$ JSON $\rightarrow$ PNG $\rightarrow$ split tasks. The diagram shows how raw SVG tangram silhouettes are parsed into JSON annotations (type, position, angle, size), rendered into training/evaluation images, and split into single-piece, two-piece, or full-tangram subsets.
Figure 2: Spatial reasoning tasks: single-piece and two-piece Tangram assembly.
Figure 3: Mean IoU across ablations on the medium triangle. The test-time self-refinement loop (ICL + reward) yields the largest gain.

TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

TL;DR

Abstract

TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)