MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Xinyu Wei; Kangrui Cen; Hongyang Wei; Zhen Guo; Bairui Li; Zeqing Wang; Jinrui Zhang; Lei Zhang

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, Lei Zhang

TL;DR

This work addresses the lack of high-quality data for multi-image composition (MICo) by introducing MICo-150K, a large, diverse dataset across seven MICo tasks and a De&Re track, built with a Compose-by-Retrieval strategy and human-in-the-loop refinement. It establishes MICo-Bench for standardized evaluation and introduces Weighted-Ref-VIEScore, a human-aligned metric that accounts for multiple sources and reference images while guarding against copy-paste hacks. The authors demonstrate robust improvements across open-source models after MICo-150K finetuning, with Qwen-MICo achieving near state-of-the-art performance while supporting arbitrary input counts. They also provide a comprehensive appendix detailing reproducibility, ethics, and additional analyses, making MICo-150K a valuable resource for advancing MICo research and development.

Abstract

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

TL;DR

Abstract

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)