Table of Contents
Fetching ...

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral

TL;DR

This work tackles the persistent problem of spatial reasoning in vision-language models by introducing REVISION, a Blender-based 3D rendering pipeline that can generate spatially faithful synthetic images from prompts using a large asset library and a deterministic coordinate generator. It demonstrates a training-free mechanism to improve T2I spatial fidelity by producing a reference image $x^{(g)}$ and guiding diffusion-based synthesis via $I = \,\phi(I\,|\,x^{(g)}, T)$, enabling substantial improvements on VISOR and T2I-CompBench. The authors also present RevQA, a robust spatial-reasoning benchmark for multimodal LLMs, revealing significant gaps and vulnerability to adversarial and negation-based questions across five state-of-the-art models. Collectively, REVISION provides a cost-efficient, modular approach that enhances spatial understanding in generative models and offers a scalable framework for evaluating and pushing forward spatial reasoning in vision-language systems.

Abstract

Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

TL;DR

This work tackles the persistent problem of spatial reasoning in vision-language models by introducing REVISION, a Blender-based 3D rendering pipeline that can generate spatially faithful synthetic images from prompts using a large asset library and a deterministic coordinate generator. It demonstrates a training-free mechanism to improve T2I spatial fidelity by producing a reference image and guiding diffusion-based synthesis via , enabling substantial improvements on VISOR and T2I-CompBench. The authors also present RevQA, a robust spatial-reasoning benchmark for multimodal LLMs, revealing significant gaps and vulnerability to adversarial and negation-based questions across five state-of-the-art models. Collectively, REVISION provides a cost-efficient, modular approach that enhances spatial understanding in generative models and offers a scalable framework for evaluating and pushing forward spatial reasoning in vision-language systems.

Abstract

Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.
Paper Structure (29 sections, 20 figures, 10 tables)

This paper contains 29 sections, 20 figures, 10 tables.

Figures (20)

  • Figure 1: Text-to-Image models struggle to generate images that faithfully represent the spatial relationships mentioned in the input prompt. We develop REVISION, an efficient rendering pipeline that enables a training-free and guidance-based mechanism to address this shortcoming. Our method results in improvements in spatial reasoning for T2I models for three dimensional relationships demonstrated by consistently higher scores on VISOR and T2I-CompBench benchmarks.
  • Figure 1: Average Success Rate of each MS-COCO object in REVISION, being spatially correct according to the input prompt in the generated image. We report results using the white background with SD v1.5.
  • Figure 2: REVISION parses a prompt into assets (objects) and the spatial relationship between them and synthesizes a symbolic image in Blender, placing the respective object assets at coordinates corresponding to the parsed spatial relationship.
  • Figure 2: Illustrative example of leveraging REVISION to generate spatially correct images with 3 objects and 2 relationships.
  • Figure 3: Outputs from the REVISION rendering pipeline for 4 spatial relationships types for identical assets, with (bottom) and without a floor (top).
  • ...and 15 more figures