Table of Contents
Fetching ...

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

TL;DR

A comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation, and an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier.

Abstract

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: https://github.com/Angusliuuu/MICON-Bench.

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

TL;DR

A comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation, and an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier.

Abstract

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: https://github.com/Angusliuuu/MICON-Bench.
Paper Structure (47 sections, 5 equations, 9 figures, 14 tables)

This paper contains 47 sections, 5 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview of MICON-Bench and Evaluation Pipeline. MICON-Bench is a comprehensive benchmark designed to evaluate multi-image context generation across six diverse tasks: Object Composition, Spatial Composition, Attribute Disentanglement, Component Transfer, FG/BG Composition, and Story Generation. Each task provides multiple reference images and a compositional prompt requiring the model to integrate, reason, or transfer information across images. To assess model performance, we introduce an Evaluation-by-Checkpoint framework driven by an MLLM, which automatically evaluates generated results along key dimensions, producing an interpretable composite score.
  • Figure 2: Overview of the proposed Dynamic Attention Rebalancing (DAR) mechanism. Given multiple reference images, DAR first samples query tokens and computes attention maps between sampled queries and reference key tokens. It then applies a dynamic weighting factor to rebalance attention responses, reinforcing relevant reference regions (green boxes) while suppressing distractions (red boxes). The example at the bottom shows how DAR helps compose a new scene by integrating the right Indian man from Ref Image 1 and the left doctor from Ref Image 2, achieving coherent and faithful visual synthesis.
  • Figure 3: Visualization examples of our method vs. baseline on MICON-Bench.
  • Figure 4: DAR suppresses noisy, irrelevant attention (red boxes) and re-focuses activations on the correct subjects (green boxes), resulting in cleaner referencing and more faithful composition. The blue boxes highlight the target subjects.
  • Figure 5: The data statistics of MICON-Bench.
  • ...and 4 more figures