Table of Contents
Fetching ...

MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

TL;DR

MultiBanana introduces a rigorous, scalable benchmark for multi-reference text-to-image generation that expands reference count up to 8 and incorporates cross-domain, scale, rare-concept, and multilingual challenges. It couples a four-stage data construction pipeline (real+synthetic data, filtering, hierarchical categorization, and instruction generation) with 48 task variants to probe compositional reasoning and identity preservation. The authors evaluate multiple open and closed models with AI-based judges (Gemini-2.5, GPT-5) across five fine-grained criteria, revealing two principal failure modes: strict reference adherence often hurts global coherence as references accumulate, while relaxing adherence can preserve visual quality but miss intended edits. They further propose agentic inference strategies (IPR, CAFG, SRA) to improve performance, provide extensive supplementary statistics, and release the benchmark openly to standardize comparisons in multi-reference generation.

Abstract

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .

MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

TL;DR

MultiBanana introduces a rigorous, scalable benchmark for multi-reference text-to-image generation that expands reference count up to 8 and incorporates cross-domain, scale, rare-concept, and multilingual challenges. It couples a four-stage data construction pipeline (real+synthetic data, filtering, hierarchical categorization, and instruction generation) with 48 task variants to probe compositional reasoning and identity preservation. The authors evaluate multiple open and closed models with AI-based judges (Gemini-2.5, GPT-5) across five fine-grained criteria, revealing two principal failure modes: strict reference adherence often hurts global coherence as references accumulate, while relaxing adherence can preserve visual quality but miss intended edits. They further propose agentic inference strategies (IPR, CAFG, SRA) to improve performance, provide extensive supplementary statistics, and release the benchmark openly to standardize comparisons in multi-reference generation.

Abstract

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce , which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .

Paper Structure

This paper contains 38 sections, 3 equations, 27 figures, 10 tables.

Figures (27)

  • Figure 1: The overview of MultiBanana. MultiBanana widely covers multi-reference specific problems, varying the number of references (the top row), domain and scale mismatch among references (two on the left in the middle row), multilingual text rendering (center in the bottom row), and containing rare concepts (right in the bottom row).
  • Figure 2: Construction pipeline for our benchmark, consisting of four stages: (1) collecting high-quality real and synthetic images, (2) filtering out inappropriate or low-quality samples, (3) performing hierarchical category classification, and (4) generating and validating editing instructions by Gemini and humans.
  • Figure 3: (Left) Comparison between the statistics of real data only and those after adding synthetic data. The original dataset was biased toward background images, with few person- and object-related samples. To correct this imbalance, we generated additional synthetic images using Nanobanana and ChatGPT-Image-1, focusing on clear subjects such as people, animals, and objects. This significantly increased person- and object-related categories, resulting in a more balanced and comprehensive benchmark. (Right) Examples of synthesized images in each category.
  • Figure 4: (Left) Breakdown of the multi-reference tasks. The editing sets were selected to ensure that the number of sets within each task is balanced across different reference counts. (Middle) For every X--references task, the dataset contains at least 390 editing sets. Further, each colored task category also includes at least 70 sets, which is larger than the prior work xia2025dreamomni2. (Right) Word cloud generated from all prompts. It primarily consists of terms that describe a wide range of object categories as well as words indicating spatial directions.
  • Figure 5: Changes in scores for each evaluation criterion when varying the number of reference images. Both open-source and closed-source models exhibit a general trend of decreasing all scores as the number of references increases.
  • ...and 22 more figures