CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

Nan Chen; Mengqi Huang; Zhuowei Chen; Yang Zheng; Lei Zhang; Zhendong Mao

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao

TL;DR

CustomContrast tackles the challenge of subject-driven T2I customization by avoiding self-reconstruction pitfalls and employing a cross-differential, contrastive learning framework. It introduces a Multimodal Feature Injection (MFI) Encoder to produce consistent cross-modal representations and a Multilevel Contrastive Learning (MCL) paradigm consisting of Crossmodal Semantic Contrastive Learning (CSCL) and Multiscale Appearance Contrastive Learning (MACL) to extract intrinsic subject attributes from high-level semantics to low-level appearance. The approach defines an $S^+$ space to enable layer-wise control and uses timesteps-specific location guidance to suppress redundant features, yielding improvements in subject similarity and text controllability on SD-V1.5 and SDXL, with substantial gains in ImageReward and CLIP-based controllability metrics. The results suggest a practical impact for robust, fine-grained customization across diverse subjects and editing scenarios, including multi-subject and human-domain generation.

Abstract

Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

TL;DR

space to enable layer-wise control and uses timesteps-specific location guidance to suppress redundant features, yielding improvements in subject similarity and text controllability on SD-V1.5 and SDXL, with substantial gains in ImageReward and CLIP-based controllability metrics. The results suggest a practical impact for robust, fine-grained customization across diverse subjects and editing scenarios, including multi-subject and human-domain generation.

Abstract

Paper Structure (20 sections, 9 equations, 7 figures, 4 tables)

This paper contains 20 sections, 9 equations, 7 figures, 4 tables.

Introduction
Related Work
Subject-Driven Text-to-image Customization
Contrastive Learning of Representations
Methodology
Preliminaries
Multimodal Feature Injection Encoder
TV Fusion Module
Multilevel Contrastive Learning Paradigm
Crossmodal Semantic Contrastive Learning
Multiscale Appearance Contrastive Learning
A. $\textit{S}^+$ Space
B. MACL in $\textit{S}^+$ Space
Timesteps-specific Subject Location
Experiments
...and 5 more sections

Figures (7)

Figure 1: Comparison with existing perspective. (a) Existing studies learn each subject feature with entangled redundant features (e.g., view, pose), suffering a trade-off between similarity and controllability (redundant and intrinsic features simultaneously overfit or underfit since they are coupled together). (b) In contrast, we rethink it from a cross-differential perspective. By using contrastive learning to ensure intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences), our model disentangles the subject intrinsic features from irrelevant features for dual optimization of controllability and similarity.
Figure 2: Overview of the proposed CustomContrast. (a) Training pipeline. The consistency between textual and visual features is accurately learned by the MFI-Encoder, which includes a Textual-Visual (TV) Fusion module to enhance feature consistency from visual and textual Qformers. (b) The MCL paradigm includes CSCL, aligning high-level semantics by contrasting visual and textual embeddings via CLS tokens, and MACL, which is applied to text embeddings from different cross-attention layers. MACL decouples redundant subject features by aligning positive samples (segmented images of the same subject from various views, positions, and sizes), while preserving relative distances by contrasting with other subjects.
Figure 3: (a) In $\textit{S}$ space, a token $\boldsymbol{s^*}$ influences all cross-attention layers. (b) In $\textit{S}^+$ space, different $\boldsymbol{s^*_i}$ control cross-attention layers. (c) MACL is applied separately to each $\boldsymbol{s^*_i}$.
Figure 4: Illustration of timesteps-specific subject location.
Figure 5: Qualitative comparison with existing methods. CustomContrast decouples intrinsic features from redundant features, enabling flexible text control over complex pose (e.g., the cat toy in the first row) and shape (e.g., cat driving car in the fourth row) transformations. In contrast, other methods underperform due to the influence of coupled redundant features.
...and 2 more figures

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

TL;DR

Abstract

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)