Table of Contents
Fetching ...

RealCustom++: Representing Images as Real Textual Word for Real-Time Customization

Zhendong Mao, Mengqi Huang, Fei Ding, Mingcong Liu, Qian He, Yongdong Zhang

TL;DR

RealCustom++ tackles the inherent conflict between subject similarity and text controllability in text-to-image customization by representing the subject with real words and enforcing subject influence only within a generated guidance mask. It introduces a train-inference decoupled framework comprising a Cross-layer Cross-scale Projector (CCP), Curriculum Training Recipe (CTR), and Adaptive Mask Guidance (AMG) to learn general alignment between visual conditions and real words and then specialize generation for the target word during inference. The approach enables Open-domain, single- and multi-subject customization in real time without per-subject finetuning and demonstrates state-of-the-art gains in controllability and fidelity across SD-v1.5 and SDXL backbones. These contributions advance practical, fine-grained image customization for varied subjects and use cases, including multi-subject scenarios, with robust generalization to pose and size variations.

Abstract

Given a text and an image of a specific subject, text-to-image customization aims to generate new images that align with both the text and the subject's appearance. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word causes semantic conflict from its different learning objective and entanglement from overlapping influence scopes with other texts, resulting in a dual-optimum paradox where subject similarity and text controllability cannot be optimal simultaneously. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to firstly generate a coherent guidance image and corresponding subject mask, thereby disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real words in the text; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. In contrast to previous methods that excel in either controllability or similarity, RealCustom++ achieves superior performance in both, with improvements of 7.48% in controllability, 3.04% in similarity, and 76.43% in generation quality. For multi-subject customization, RealCustom++ further achieves improvements of 4.6% in controllability and 6.34% in multi-subject similarity. Our work has been applied in JiMeng of ByteDance, and codes are released at https://github.com/bytedance/RealCustom.

RealCustom++: Representing Images as Real Textual Word for Real-Time Customization

TL;DR

RealCustom++ tackles the inherent conflict between subject similarity and text controllability in text-to-image customization by representing the subject with real words and enforcing subject influence only within a generated guidance mask. It introduces a train-inference decoupled framework comprising a Cross-layer Cross-scale Projector (CCP), Curriculum Training Recipe (CTR), and Adaptive Mask Guidance (AMG) to learn general alignment between visual conditions and real words and then specialize generation for the target word during inference. The approach enables Open-domain, single- and multi-subject customization in real time without per-subject finetuning and demonstrates state-of-the-art gains in controllability and fidelity across SD-v1.5 and SDXL backbones. These contributions advance practical, fine-grained image customization for varied subjects and use cases, including multi-subject scenarios, with robust generalization to pose and size variations.

Abstract

Given a text and an image of a specific subject, text-to-image customization aims to generate new images that align with both the text and the subject's appearance. Existing works follow the pseudo-word paradigm, which represents the subject as a non-existent pseudo word and combines it with other text to generate images. However, the pseudo word causes semantic conflict from its different learning objective and entanglement from overlapping influence scopes with other texts, resulting in a dual-optimum paradox where subject similarity and text controllability cannot be optimal simultaneously. To address this, we propose RealCustom++, a novel real-word paradigm that represents the subject with a non-conflicting real word to firstly generate a coherent guidance image and corresponding subject mask, thereby disentangling the influence scopes of the text and subject for simultaneous optimization. Specifically, RealCustom++ introduces a train-inference decoupled framework: (1) during training, it learns a general alignment between visual conditions and all real words in the text; and (2) during inference, a dual-branch architecture is employed, where the Guidance Branch produces the subject guidance mask and the Generation Branch utilizes this mask to customize the generation of the specific real word exclusively within subject-relevant regions. In contrast to previous methods that excel in either controllability or similarity, RealCustom++ achieves superior performance in both, with improvements of 7.48% in controllability, 3.04% in similarity, and 76.43% in generation quality. For multi-subject customization, RealCustom++ further achieves improvements of 4.6% in controllability and 6.34% in multi-subject similarity. Our work has been applied in JiMeng of ByteDance, and codes are released at https://github.com/bytedance/RealCustom.
Paper Structure (21 sections, 17 equations, 18 figures, 13 tables, 1 algorithm)

This paper contains 21 sections, 17 equations, 18 figures, 13 tables, 1 algorithm.

Figures (18)

  • Figure 1: (a) Existing paradigm represents the subject as a pseudo word ($S^*$) and combines it with the text for generation. The pseudo word inherently conflicts (i.e., causes other real words to deviate from their original semantics) and entangles (i.e., has overlapping influence scope) with the text, resulting in the dual-optimum paradox that involves a trade-off between subject similarity and text controllability. (b) RealCustom++ first represents the subject as real words (e.g., the subject's super-category) to generate a guidance image in the guidance branch, providing the subject guidance mask. Then, in the generation branch, the subject influences only within the mask, while other regions are controlled purely by the text, achieving both high similarity and controllability.
  • Figure 2: The quantitative comparison shows that RealCustom++ achieves the highest similarity and controllability to the existing paradigm simultaneously.
  • Figure 3: Our RealCustom++ is capable of various customization tasks. (a) One2One: Given a single image depicting the given subject (in open domain, e.g., humans, cartoons, clothes, buildings), RealCustom++ can synthesize images that are consistent with both the semantics of the texts and the appearance of the subjects. (in real-time without any finetuning steps). (b) One2Many: RealCustom++ can decouple and customize each subject in a single reference image. (c) Many2Many: RealCustom++ can customize multiple subjects from multiple reference images. The customized words are highlighted in color.
  • Figure 4: Schematic comparison. We completely redesign the paradigm: the conference version huang2024realcustom adopts reconstruction training, which restricts fine-grained image features to avoid overfitting. Our new re-contextualization training introduces references with diverse subject sizes and poses, effectively preventing overfitting and enabling the use of richer image features, leading to simultaneous improvements in both similarity and controllability.
  • Figure 5: Demonstration of the trade-off between subject similarity and text controllability in existing pseudo-word paradigms (illustrated with a representative pseudo-word approach, i.e., Textual Inversion gal2022image): increasing regularization weight reduces subject similarity but improves text controllability, revealing a dual-optimum paradox where both cannot be maximized simultaneously.
  • ...and 13 more figures