REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral
TL;DR
This work tackles the persistent problem of spatial reasoning in vision-language models by introducing REVISION, a Blender-based 3D rendering pipeline that can generate spatially faithful synthetic images from prompts using a large asset library and a deterministic coordinate generator. It demonstrates a training-free mechanism to improve T2I spatial fidelity by producing a reference image $x^{(g)}$ and guiding diffusion-based synthesis via $I = \,\phi(I\,|\,x^{(g)}, T)$, enabling substantial improvements on VISOR and T2I-CompBench. The authors also present RevQA, a robust spatial-reasoning benchmark for multimodal LLMs, revealing significant gaps and vulnerability to adversarial and negation-based questions across five state-of-the-art models. Collectively, REVISION provides a cost-efficient, modular approach that enhances spatial understanding in generative models and offers a scalable framework for evaluating and pushing forward spatial reasoning in vision-language systems.
Abstract
Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR and T2I-CompBench benchmarks. We also design RevQA, a question-answering benchmark to evaluate the spatial reasoning abilities of MLLMs, and find that state-of-the-art models are not robust to complex spatial reasoning under adversarial settings. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware generative models.
