Table of Contents
Fetching ...

MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu

Abstract

Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Abstract

Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

Paper Structure

This paper contains 49 sections, 2 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: Overview of MacroData. MacroData contains 400K high-quality samples with up to 10 input images across four long-context multi-reference image generation tasks: (a) Customization: generating compositions conditioned on multiple reference images, (b) Illustration: producing illustrative images based on multimodal context, (c) Spatial: predicting novel view images given multiple views, specifically including outside-in objects and inside-out scenes, and (d) Temporal: forecasting future frames based on historical sequence. Each task is composed of 100k samples, split into different numbers of reference images, including 1-3, 4-5, 6-7, and 8-10 input images.
  • Figure 2: Statistics of MacroData. MacroData comprises four tasks, each containing 100k samples. (a) The number of input images in the customization subtask averaged 5.84 images per sample, with a maximum of 10. (b) The number of input images across all tasks averaged 5.44 images per sample. (c) Comparison among different datasets. (d) The distribution of data composition for each task.
  • Figure 3: Customization Subset Pipeline composites preprocessed metadata via rule-based and VLM-reasoned sampling and applies a bidirectional assessment to ensure reference fidelity and prompt consistency.
  • Figure 4: Illustration Subset Pipeline identifies highly relevant anchor images from interleaved data as generation targets and utilizes VLMs to rewrite and filter the preceding context for narrative coherence.
  • Figure 5: Spatial Subset Pipeline samples input and target views from canonical directions for outside-in objects and inside-out panoramas, applying spatial overlap filters to ensure plausibility.
  • ...and 21 more figures