Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs
Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah
TL;DR
The paper addresses the binding problem in LVLMs, where parallel visual processing leads to misbinding of features to objects, degrading counting, visual search, scene description, and spatial reasoning. It introduces Visual Input Structure for Enhanced Reasoning (VISER), which augments images with simple horizontal lines and pairs them with a sequential scanning prompt to promote region-wise, serial parsing. Across synthetic and real-world benchmarks and a mix of LVLMs, VISER yields substantial gains on core visual reasoning tasks (e.g., approximately 25–27% improvements in visual search and counting, about a 9.5% gain in spatial reasoning, and a 0.32 reduction in edit distance for scene descriptions), while purely textual strategies like CoT offer limited or negative gains. The results underscore the importance of visual input design for binding and reasoning, suggesting avenues for adaptive scaffolds and integrated spatial attention in future work to further mitigate binding errors.
Abstract
Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current LVLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces Visual Input Structure for Enhanced Reasoning (VISER), a simple, effective method that augments visual inputs with low-level spatial structures and pairs them with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query inference. Specifically, VISER improves GPT-4o performance on visual search, counting, and spatial relationship tasks by 25.0%, 26.8%, and 9.5%, respectively, and reduces edit distance error in scene description by 0.32 on 2D datasets. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER underscores the importance of visual input design over purely linguistically based reasoning strategies and suggests that visual structuring is a powerful and general approach for enhancing compositional and spatial reasoning in LVLMs.
