Table of Contents
Fetching ...

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

Peijin Xie, Lin Sun, Bingquan Liu, Dexin Wang, Xiangzheng Zhang, Chengjie Sun, Jiajia Zhang

TL;DR

This work addresses the lack of a unified evaluation framework for visual spatial reasoning (VSR) in Vision-Language Large Models (VLLMs) and the inconsistent handling of visual information and instructions. It introduces a unified instruction-based VSR test set and expands both data and model architecture by leveraging text augmentation, diffusion-based image generation, and a merged vision encoder (CLIP, SigLIP, DINOv2, SAM). The proposed VSRE (VSR Expert) demonstrates a ~27% accuracy gain on the VSR benchmark and generalizes to related datasets (MME, MMBench, SEEDv2), highlighting improved sensitivity to visual positional information and reduced answer bias. The work offers an open-source VSRE and data pipeline to accelerate VSR learning in VLLMs and provides a scalable framework for future cross-modal reasoning improvements.

Abstract

Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27\% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We open-sourced the expanded model with data and Appendix at \url{https://github.com/peijin360/vsre} and hope it will accelerate advancements in VLLM on VSR learning.

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

TL;DR

This work addresses the lack of a unified evaluation framework for visual spatial reasoning (VSR) in Vision-Language Large Models (VLLMs) and the inconsistent handling of visual information and instructions. It introduces a unified instruction-based VSR test set and expands both data and model architecture by leveraging text augmentation, diffusion-based image generation, and a merged vision encoder (CLIP, SigLIP, DINOv2, SAM). The proposed VSRE (VSR Expert) demonstrates a ~27% accuracy gain on the VSR benchmark and generalizes to related datasets (MME, MMBench, SEEDv2), highlighting improved sensitivity to visual positional information and reduced answer bias. The work offers an open-source VSRE and data pipeline to accelerate VSR learning in VLLMs and provides a scalable framework for future cross-modal reasoning improvements.

Abstract

Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27\% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We open-sourced the expanded model with data and Appendix at \url{https://github.com/peijin360/vsre} and hope it will accelerate advancements in VLLM on VSR learning.

Paper Structure

This paper contains 23 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The overall expansion method through the training process. On the text branch, questions and answers are rewritten by temples in green blocks. On the image branch, the image inputs are repainted by the diffusion model through image to image, text to image and image inpainting 3 methods in respectively 3 rows in the left orange block. And the expansion on vision encoder to a powerful merged one is shown in the middle orange block with dashed box.
  • Figure 2: Examples of 3 settings of image-to-image(first row), text-to-image(middle row), and inpainting(last row) through the repainting process with the original image-text pair and mask inputs on the left.
  • Figure 3: Illustration of the Merged Vision Encoder that concatenate multiple visual features aligned by projector or adapter respectively.
  • Figure 4: Result of scaling vision model. We post the accuracy of Test-G on the left in dashed lines and Test-S on the right in solid lines.
  • Figure 5: Distribution of selected 200 samples across 7 common spatial relations with llava1.5 13B (acc 51.2%) on the left and VSRE(acc 79.5%) on the right.
  • ...and 1 more figures