Table of Contents
Fetching ...

ICAS: IP Adapter and ControlNet-based Attention Structure for Multi-Subject Style Transfer Optimization

Fuwei Liu

TL;DR

ICAS addresses multi-subject style transfer under limited data by decoupling style and structure guidance in a diffusion framework. It combines IP-Adapter for style injection with ControlNet for structural conditioning and employs partial fine-tuning of the content injection branch along with a cyclic content embedding strategy, enabling efficient and faithful multi-subject stylization. The approach achieves superior structure preservation, style coherence, and inference efficiency compared with inversion-based and data-hungry baselines, as demonstrated across extensive experiments and user studies. The work offers a practical pathway for real-world multi-subject stylization with limited annotated data and tight computational budgets.

Abstract

Generating multi-subject stylized images remains a significant challenge due to the ambiguity in defining style attributes (e.g., color, texture, atmosphere, and structure) and the difficulty in consistently applying them across multiple subjects. Although recent diffusion-based text-to-image models have achieved remarkable progress, existing methods typically rely on computationally expensive inversion procedures or large-scale stylized datasets. Moreover, these methods often struggle with maintaining multi-subject semantic fidelity and are limited by high inference costs. To address these limitations, we propose ICAS (IP-Adapter and ControlNet-based Attention Structure), a novel framework for efficient and controllable multi-subject style transfer. Instead of full-model tuning, ICAS adaptively fine-tunes only the content injection branch of a pre-trained diffusion model, thereby preserving identity-specific semantics while enhancing style controllability. By combining IP-Adapter for adaptive style injection with ControlNet for structural conditioning, our framework ensures faithful global layout preservation alongside accurate local style synthesis. Furthermore, ICAS introduces a cyclic multi-subject content embedding mechanism, which enables effective style transfer under limited-data settings without the need for extensive stylized corpora. Extensive experiments show that ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency, establishing a new paradigm for multi-subject style transfer in real-world applications.

ICAS: IP Adapter and ControlNet-based Attention Structure for Multi-Subject Style Transfer Optimization

TL;DR

ICAS addresses multi-subject style transfer under limited data by decoupling style and structure guidance in a diffusion framework. It combines IP-Adapter for style injection with ControlNet for structural conditioning and employs partial fine-tuning of the content injection branch along with a cyclic content embedding strategy, enabling efficient and faithful multi-subject stylization. The approach achieves superior structure preservation, style coherence, and inference efficiency compared with inversion-based and data-hungry baselines, as demonstrated across extensive experiments and user studies. The work offers a practical pathway for real-world multi-subject stylization with limited annotated data and tight computational budgets.

Abstract

Generating multi-subject stylized images remains a significant challenge due to the ambiguity in defining style attributes (e.g., color, texture, atmosphere, and structure) and the difficulty in consistently applying them across multiple subjects. Although recent diffusion-based text-to-image models have achieved remarkable progress, existing methods typically rely on computationally expensive inversion procedures or large-scale stylized datasets. Moreover, these methods often struggle with maintaining multi-subject semantic fidelity and are limited by high inference costs. To address these limitations, we propose ICAS (IP-Adapter and ControlNet-based Attention Structure), a novel framework for efficient and controllable multi-subject style transfer. Instead of full-model tuning, ICAS adaptively fine-tunes only the content injection branch of a pre-trained diffusion model, thereby preserving identity-specific semantics while enhancing style controllability. By combining IP-Adapter for adaptive style injection with ControlNet for structural conditioning, our framework ensures faithful global layout preservation alongside accurate local style synthesis. Furthermore, ICAS introduces a cyclic multi-subject content embedding mechanism, which enables effective style transfer under limited-data settings without the need for extensive stylized corpora. Extensive experiments show that ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency, establishing a new paradigm for multi-subject style transfer in real-world applications.

Paper Structure

This paper contains 20 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The overall architecture of our proposed ICAS. Combined with pre-trained IP-Adapter (SIM) for style injection, and ControlNet-based structure preservation module (SPM). Specifically, style image embedding and content embedding list are injected through different cross-attention paths in IP-Adapter to ensure multi-subject content feature and style feature fusion (“SIM”), while the ControlNet branch (“SPM”) receives structural conditions (e.g., edges or depth) to maintain global layout. We also fine-tune iPadAdapter on the content graph by training cross-attention on iPadAdapter path using a small dataset of our own choice. Then, the unified information flows through the diffusion U-Net to generate the final stylized image, which preserves both subject identity and structural consistency.
  • Figure 2: Comparison with state-of-the-art stylization approaches.
  • Figure 3: The impact of multiple content embeddings on multi-subject fidelity. We compare single embedding (encoding each content image only once) with multi-embedding method (injecting multiple embeddings cyclically). As shown in the figure, the single embedding strategy occasionally occludes or merges some subjects into the background, while the multi-embedding strategy can preserve different appearances for each subject. This confirms that injecting multiple content embeddings can effectively improve the identity preservation rate of each subject in multi-subject scenarios.
  • Figure 4: Training strategies and complexity analysis. We compare three methods: No-Finetune (fully pre-trained IP adapter, no updating), Full-Finetune (simultaneously training style and content blocks), and our Content-Only method (freezing style blocks and only updating num_control_attn). As shown in the figure, No-Finetune and Full-Finetune often suffer from style imbalance or inaccurate subject identification, while the Content-Only strategy retains the pre-trained style knowledge and can effectively adapt to multi-subject content. This achieves the best balance between parameter overhead and final image fidelity.
  • Figure 5: Effect of structure scale $\gamma$ on geometry and style. From left to right, we vary $\gamma$ from 0.4 to 0.8 when injecting ControlNet's structural features. Lower values (e.g. 0.4 or 0.5) under-constrain geometry, causing slight subject distortions, while values above 0.6 yield more stable layouts. We empirically find $\gamma = 0.7$ strikes the best balance between structure preservation and rich stylization.
  • ...and 1 more figures