Evaluating Compositional Scene Understanding in Multimodal Generative Models
Shuhao Fu, Andrew Jun Lee, Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor W. Webb
TL;DR
This work systematically evaluates the compositional scene understanding capabilities of the current generation of multimodal generative models. It combines a detailed assessment of DALL-E 3 on basic, reversed, and compositional relational prompts with human judgments, and a cross-model evaluation of several multimodal LLMs on real-world Bongard-HOI and synthetic SVRT tasks, benchmarked against human performance. The findings show measurable progress over prior models but persistent gaps relative to humans, especially as scene complexity increases and multi-relational reasoning is required. The study highlights the binding problem as a core bottleneck and suggests future work integrating object-centric representations to bolster robust, generalizable compositional reasoning in vision-language systems.
Abstract
The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.
