REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Agneet Chatterjee; Yiran Luo; Tejas Gokhale; Yezhou Yang; Chitta Baral

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral

TL;DR

This work tackles the persistent problem of spatial reasoning in vision-language models by introducing REVISION, a Blender-based 3D rendering pipeline that can generate spatially faithful synthetic images from prompts using a large asset library and a deterministic coordinate generator. It demonstrates a training-free mechanism to improve T2I spatial fidelity by producing a reference image $x^{(g)}$ and guiding diffusion-based synthesis via $I = \,\phi(I\,|\,x^{(g)}, T)$, enabling substantial improvements on VISOR and T2I-CompBench. The authors also present RevQA, a robust spatial-reasoning benchmark for multimodal LLMs, revealing significant gaps and vulnerability to adversarial and negation-based questions across five state-of-the-art models. Collectively, REVISION provides a cost-efficient, modular approach that enhances spatial understanding in generative models and offers a scalable framework for evaluating and pushing forward spatial reasoning in vision-language systems.

Abstract

Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

TL;DR

and guiding diffusion-based synthesis via

, enabling substantial improvements on VISOR and T2I-CompBench. The authors also present RevQA, a robust spatial-reasoning benchmark for multimodal LLMs, revealing significant gaps and vulnerability to adversarial and negation-based questions across five state-of-the-art models. Collectively, REVISION provides a cost-efficient, modular approach that enhances spatial understanding in generative models and offers a scalable framework for evaluating and pushing forward spatial reasoning in vision-language systems.

Abstract

Paper Structure (29 sections, 20 figures, 10 tables)

This paper contains 29 sections, 20 figures, 10 tables.

Introduction
Related Work
Generative Models for Image Synthesis.
Controllable Image Generation for Spatial Fidelity.
Synthetic Images for Vision and Language.
Evaluation of Multimodal LLMs.
The REVISION Framework
Improving Spatial Fidelity in T2I Generation
Training-Free Image Generation with REVISION
Experimental Setup
Results and Analysis
Ablation Studies
Controllability vs Photo-Realism -
Extending VISOR for Depth Relationships
Human Evaluations
...and 14 more sections

Figures (20)

Figure 1: Text-to-Image models struggle to generate images that faithfully represent the spatial relationships mentioned in the input prompt. We develop REVISION, an efficient rendering pipeline that enables a training-free and guidance-based mechanism to address this shortcoming. Our method results in improvements in spatial reasoning for T2I models for three dimensional relationships demonstrated by consistently higher scores on VISOR and T2I-CompBench benchmarks.
Figure 1: Average Success Rate of each MS-COCO object in REVISION, being spatially correct according to the input prompt in the generated image. We report results using the white background with SD v1.5.
Figure 2: REVISION parses a prompt into assets (objects) and the spatial relationship between them and synthesizes a symbolic image in Blender, placing the respective object assets at coordinates corresponding to the parsed spatial relationship.
Figure 2: Illustrative example of leveraging REVISION to generate spatially correct images with 3 objects and 2 relationships.
Figure 3: Outputs from the REVISION rendering pipeline for 4 spatial relationships types for identical assets, with (bottom) and without a floor (top).
...and 15 more figures

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

TL;DR

Abstract

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (20)