Table of Contents
Fetching ...

In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

Yu Xu, Fan Tang, You Wu, Lin Gao, Oliver Deussen, Hongbin Yan, Jintao Li, Juan Cao, Tong-Yee Lee

TL;DR

This work tackles zero-shot customized subject insertion into existing images by reframing the problem as in-context learning within diffusion transformers. It introduces latent feature shifting to transfer subject semantics, plus head-wise attention reweighting and token blending to enhance prompt fidelity and visual coherence, all without model training. The approach demonstrates superior identity preservation, text alignment, and image quality across multiple benchmarks and user studies, with practical applications in virtual try-on, compositional generation, and partial insertions. Overall, In-Context Brush offers a training-free, flexible framework for precise, context-aware subject insertion guided by textual prompts.

Abstract

Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly "brushes" user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose "In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head "latent feature shifting" within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head "attention reweighting" across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.

In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

TL;DR

This work tackles zero-shot customized subject insertion into existing images by reframing the problem as in-context learning within diffusion transformers. It introduces latent feature shifting to transfer subject semantics, plus head-wise attention reweighting and token blending to enhance prompt fidelity and visual coherence, all without model training. The approach demonstrates superior identity preservation, text alignment, and image quality across multiple benchmarks and user studies, with practical applications in virtual try-on, compositional generation, and partial insertions. Overall, In-Context Brush offers a training-free, flexible framework for precise, context-aware subject insertion guided by textual prompts.

Abstract

Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly "brushes" user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose "In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head "latent feature shifting" within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head "attention reweighting" across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.

Paper Structure

This paper contains 32 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Pipeline of our method. We mainly introduce latent space shifting for subject present in target images in a training-free manner. In the "Latent Feature Shifting" part, features from the reference are shifted to output. We propose attention heads activation for further enhance representation of textual prompts and token blending for consistency injection within the image.
  • Figure 2: Qualitative comparison on subject injection and editing with baseline methods. Results of our results maintain identity consistency with reference while preserving fine-grained features, and are also aligning with the prompts. Masks are labeled as white boxes on target images.
  • Figure 3: User study results.
  • Figure 4: Comparisons with two-stage methods.
  • Figure 5: Ablation study on shift strength.
  • ...and 6 more figures