Table of Contents
Fetching ...

Evaluating Compositional Scene Understanding in Multimodal Generative Models

Shuhao Fu, Andrew Jun Lee, Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor W. Webb

TL;DR

This work systematically evaluates the compositional scene understanding capabilities of the current generation of multimodal generative models. It combines a detailed assessment of DALL-E 3 on basic, reversed, and compositional relational prompts with human judgments, and a cross-model evaluation of several multimodal LLMs on real-world Bongard-HOI and synthetic SVRT tasks, benchmarked against human performance. The findings show measurable progress over prior models but persistent gaps relative to humans, especially as scene complexity increases and multi-relational reasoning is required. The study highlights the binding problem as a core bottleneck and suggests future work integrating object-centric representations to bolster robust, generalizable compositional reasoning in vision-language systems.

Abstract

The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

Evaluating Compositional Scene Understanding in Multimodal Generative Models

TL;DR

This work systematically evaluates the compositional scene understanding capabilities of the current generation of multimodal generative models. It combines a detailed assessment of DALL-E 3 on basic, reversed, and compositional relational prompts with human judgments, and a cross-model evaluation of several multimodal LLMs on real-world Bongard-HOI and synthetic SVRT tasks, benchmarked against human performance. The findings show measurable progress over prior models but persistent gaps relative to humans, especially as scene complexity increases and multi-relational reasoning is required. The study highlights the binding problem as a core bottleneck and suggests future work integrating object-centric representations to bolster robust, generalizable compositional reasoning in vision-language systems.

Abstract

The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many () objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

Paper Structure

This paper contains 35 sections, 21 figures.

Figures (21)

  • Figure 1: Examples of images generated by DALL-E 3 for basic relational prompts (prompts the describe scenes involving relations), using the 'natural' style. See Figure \ref{['basic_relation_additional_examples']} for more examples.
  • Figure 2: Results for basic relational prompts. The Y axis indicates the agreement between prompts and the images generated by either DALL-E 3 or DALL-E 2, as judged by human participants. Results for DALL-E 3 (in blue) are from the present study. Results for DALL-E 2 (in red) are from Conwell & Ullman (conwell2022testing). Each point reflects the average agreement for an individual image. Horizontal lines indicate the mean agreement for each relation, and boxes indicate 95% confidence intervals.
  • Figure 3: Examples of images generated by DALL-E 3 for basic relational prompts and their corresponding reversed prompts using the 'vivid' style. See Figure \ref{['reversed_relation_additional_examples']} for more examples.
  • Figure 4: Results for compositional relational prompts. The Y axis indicates the agreement between prompts and the images generated by either DALL-E 3, as judged by human participants. Each point reflects the average agreement for an individual image. Horizontal lines indicate the mean agreement for each relation, and boxes indicate 95% confidence intervals.
  • Figure 5: Examples of images generated by DALL-E 3 for compositional relational prompts using the 'vivid' style. See Figure \ref{['compositional_relation_additional_examples']} for more examples.
  • ...and 16 more figures