Expand VSR Benchmark for VLLM to Expertize in Spatial Rules
Peijin Xie, Lin Sun, Bingquan Liu, Dexin Wang, Xiangzheng Zhang, Chengjie Sun, Jiajia Zhang
TL;DR
This work addresses the lack of a unified evaluation framework for visual spatial reasoning (VSR) in Vision-Language Large Models (VLLMs) and the inconsistent handling of visual information and instructions. It introduces a unified instruction-based VSR test set and expands both data and model architecture by leveraging text augmentation, diffusion-based image generation, and a merged vision encoder (CLIP, SigLIP, DINOv2, SAM). The proposed VSRE (VSR Expert) demonstrates a ~27% accuracy gain on the VSR benchmark and generalizes to related datasets (MME, MMBench, SEEDv2), highlighting improved sensitivity to visual positional information and reduced answer bias. The work offers an open-source VSRE and data pipeline to accelerate VSR learning in VLLMs and provides a scalable framework for future cross-modal reasoning improvements.
Abstract
Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27\% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We open-sourced the expanded model with data and Appendix at \url{https://github.com/peijin360/vsre} and hope it will accelerate advancements in VLLM on VSR learning.
