Table of Contents
Fetching ...

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

Ning Li, Jingran Zhang, Justin Cui

TL;DR

This work interrogates whether GPT-4o can seamlessly fuse world knowledge with image generation by evaluating global instruction adherence, fine-grained editing, and post-generation reasoning. Using three prompt families—Global Instruction, Image Editing, and Post-Generation Reasoning—the study reveals persistent gaps: literal interpretation of prompts, inconsistent application of knowledge constraints, and difficulty maintaining context across sequential tasks. These findings challenge the assumption of true unification between image understanding and generation in multimodal LLMs and underscore the need for benchmarks and training strategies that emphasize context-aware, reasoning-grounded generation. The work thus informs future benchmark design and training directions to advance robust, knowledge-guided multimodal synthesis.

Abstract

OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

TL;DR

This work interrogates whether GPT-4o can seamlessly fuse world knowledge with image generation by evaluating global instruction adherence, fine-grained editing, and post-generation reasoning. Using three prompt families—Global Instruction, Image Editing, and Post-Generation Reasoning—the study reveals persistent gaps: literal interpretation of prompts, inconsistent application of knowledge constraints, and difficulty maintaining context across sequential tasks. These findings challenge the assumption of true unification between image understanding and generation in multimodal LLMs and underscore the need for benchmarks and training strategies that emphasize context-aware, reasoning-grounded generation. The work thus informs future benchmark design and training directions to advance robust, knowledge-guided multimodal synthesis.

Abstract

OpenAI's multimodal GPT-4o has demonstrated remarkable capabilities in image generation and editing, yet its ability to achieve world knowledge-informed semantic synthesis--seamlessly integrating domain knowledge, contextual reasoning, and instruction adherence--remains unproven. In this study, we systematically evaluate these capabilities across three critical dimensions: (1) Global Instruction Adherence, (2) Fine-Grained Editing Precision, and (3) Post-Generation Reasoning. While existing benchmarks highlight GPT-4o's strong capabilities in image generation and editing, our evaluation reveals GPT-4o's persistent limitations: the model frequently defaults to literal interpretations of instructions, inconsistently applies knowledge constraints, and struggles with conditional reasoning tasks. These findings challenge prevailing assumptions about GPT-4o's unified understanding and generation capabilities, exposing significant gaps in its dynamic knowledge integration. Our study calls for the development of more robust benchmarks and training strategies that go beyond surface-level alignment, emphasizing context-aware and reasoning-grounded multimodal generation.

Paper Structure

This paper contains 10 sections, 4 figures.

Figures (4)

  • Figure 1: Demonstration of a global instruction prompt example.
  • Figure 2: Examples of generated images with global instructions.
  • Figure 3: Examples of image editing performed by GPT-4o.
  • Figure 4: Examples of post-generation reasoning performed by GPT-4o.