Table of Contents
Fetching ...

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, Lei Zhang

TL;DR

This work addresses the lack of high-quality data for multi-image composition (MICo) by introducing MICo-150K, a large, diverse dataset across seven MICo tasks and a De&Re track, built with a Compose-by-Retrieval strategy and human-in-the-loop refinement. It establishes MICo-Bench for standardized evaluation and introduces Weighted-Ref-VIEScore, a human-aligned metric that accounts for multiple sources and reference images while guarding against copy-paste hacks. The authors demonstrate robust improvements across open-source models after MICo-150K finetuning, with Qwen-MICo achieving near state-of-the-art performance while supporting arbitrary input counts. They also provide a comprehensive appendix detailing reproducibility, ethics, and additional analyses, making MICo-150K a valuable resource for advancing MICo research and development.

Abstract

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

TL;DR

This work addresses the lack of high-quality data for multi-image composition (MICo) by introducing MICo-150K, a large, diverse dataset across seven MICo tasks and a De&Re track, built with a Compose-by-Retrieval strategy and human-in-the-loop refinement. It establishes MICo-Bench for standardized evaluation and introduces Weighted-Ref-VIEScore, a human-aligned metric that accounts for multiple sources and reference images while guarding against copy-paste hacks. The authors demonstrate robust improvements across open-source models after MICo-150K finetuning, with Qwen-MICo achieving near state-of-the-art performance while supporting arbitrary input counts. They also provide a comprehensive appendix detailing reproducibility, ethics, and additional analyses, making MICo-150K a valuable resource for advancing MICo research and development.

Abstract

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.

Paper Structure

This paper contains 28 sections, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Previous MICo methods typically collect high-quality images or video frames as target images (1). Using Open-Vocabulary Detectors (OVD) liu2024groundingdinomarryingdino and SAM kirillov2023SAM, objects within targets are segmented to obtain source images (2). Some methods enhance the targets by retrieving additional frames of the same subject from videos (3), or enhance the sources using S2I (Subject-to-Image) or inpainting models (4). Training pairs are then constructed along multiple paths: (2→1), (2→3), (4→1), and (4→3). However, the masks in (2) are often incomplete and semantically ambiguous; the generated images in (4) tend to share similar styles, content, and limited diversity due to reliance on a few fixed generative models; the frames in (3) originate from a small number of high-quality videos, leading to limited scene variety and a lack of imaginative or complex multi-subject scenarios.
  • Figure 2: Construction pipeline of MICo-150K. (a) The data construction pipeline for the Human-Centric, Object-Centric, and HOI (Human–Object Interaction) tasks. (b) The pipeline for the De&Re (Decompose and Recompose) task.
  • Figure 3: Visualization examples from the MICo-150K dataset. Row 1 (Object-Centric): “2 objects + scene” and “4 objects” compositions. Row 2 (Person-Centric): “3 women” and “2 persons + scene”. Row 3 (Human-Object Interaction): “1 person + 4 objects” and “2 persons + 2 objects”. Row 4 (De&Re): the first image is a real-world photo, the last is the recomposed result, with intermediate visual elements including decomposed persons, objects, clothes, and scene components.
  • Figure 4: High-quality multi-image composition datasets that are non-segmentation-based and not generated by Flux series labs2025flux1kontext are extremely rare; to the best of our knowledge, only Echo-4o ye2025echo4o is publicly available. MICo-150K significantly surpasses it in both source image diversity and text prompt semantic variety.
  • Figure 5: Traditional VIEScore requires inputting all source and generated images into the evaluator, which often leads to degraded performance as GPT-4o’s cross-image attention becomes overloaded. This prevents the model from fully understanding each image and accurately determining whether every source appears in the target, resulting in substantial scoring errors (in this example, all the three human evaluators unanimously agreed that Image B was far superior). In contrast, MICo-Bench first assess whether each source image appears in the generated result to produce weights. Each case also includes a verified reference image that contains all sources. During evaluation, GPT-4o compares only the generated image and the reference image, enabling human-level judgment accuracy.
  • ...and 15 more figures