Table of Contents
Fetching ...

IC-Custom: Diverse Image Customization via In-Context Learning

Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan

TL;DR

IC-Custom addresses the challenge of unified image customization by bridging position-aware and position-free tasks under a single in-context framework. It introduces In-Context Multi-Modal Attention (ICMA) with learnable task tokens and boundary-aware embeddings, and an in-context diptych input representation built on the DiT architecture. To train effectively, it constructs CustomData (12K identity-consistent diptychs) and evaluates on ProductBench and DreamBench, achieving substantial gains in identity consistency, harmony, and text alignment with only ~0.4% of parameters updated. The results indicate strong practical potential for industrial media workflows, enabling flexible, identity-preserving edits across diverse scenarios.

Abstract

Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73\% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4\% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom

IC-Custom: Diverse Image Customization via In-Context Learning

TL;DR

IC-Custom addresses the challenge of unified image customization by bridging position-aware and position-free tasks under a single in-context framework. It introduces In-Context Multi-Modal Attention (ICMA) with learnable task tokens and boundary-aware embeddings, and an in-context diptych input representation built on the DiT architecture. To train effectively, it constructs CustomData (12K identity-consistent diptychs) and evaluates on ProductBench and DreamBench, achieving substantial gains in identity consistency, harmony, and text alignment with only ~0.4% of parameters updated. The results indicate strong practical potential for industrial media workflows, enabling flexible, identity-preserving edits across diverse scenarios.

Abstract

Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to effectively handle diverse tasks and distinguish between inputs in polyptych configurations. To address the data gap, we curated a 12K identity-consistent dataset with 8K real-world and 4K high-quality synthetic samples, avoiding the overly glossy, oversaturated look typical of synthetic data. IC-Custom supports various industrial applications, including try-on, image insertion, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves about 73\% higher human preference across identity consistency, harmony, and text alignment metrics, while training only 0.4\% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom

Paper Structure

This paper contains 44 sections, 6 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Visualization of IC-Custom results. Our method supports diverse image customization scenarios, including position-aware (location-specified editing conditioned on a mask) and position-free (ID-consistent generation guided by text) customization.
  • Figure 2: Model overview. (1) Our model takes in-context diptych inputs together with redux embeddings and text prompts. (2) During training, it randomly chooses to mask either the entire fill-in image (position-free customization) or only partial regions (position-aware customization) to produce diverse in-context latents. (3) The ICMA module, equipped with task-oriented register tokens and boundary-aware positional embeddings (see Sec. \ref{['sec:in-context-multi-modal-attention']}), is integrated into the architecture. We train LoRA adapters on the ICMA module while unfreezing the input layers.
  • Figure 3: (a) In-Context Multi-Modal Attention (ICMA). ICMA incorporates learnable task-oriented register tokens and boundary-aware positional embeddings (RE, FE) into the multi-modal attention of MM-DiT dit to specify customization types and delineate input boundaries. (b) Training data examples. High-quality identity-consistent quadruples $\{C_{\mathrm{I}}, C_{\mathrm{I}'}, M, C_{\mathrm{T}}\}$ from real-world and synthetic data; for clarity, text descriptions $C_{\mathrm{T}}$ are omitted.
  • Figure 4: Qualitative comparison of position-aware customization under precise-mask and user-drawn-mask settings. OminiCtrl and DreamO lack support for fill-in inputs. IC-Custom achieves high-quality customization with harmonious lighting, shadows, and perspectives.
  • Figure 5: Qualitative comparison on position-free customization.IC-Custom achieves more realistic, coherent, and detailed customization. Red circles highlight incorrect regions or details.
  • ...and 10 more figures