Table of Contents
Fetching ...

CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

Yu Li, Yujun Cai, Chi Zhang

TL;DR

CRAFT-LoRA significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.

Abstract

Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.

CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion

TL;DR

CRAFT-LoRA significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.

Abstract

Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.
Paper Structure (25 sections, 22 equations, 9 figures, 10 tables, 2 algorithms)

This paper contains 25 sections, 22 equations, 9 figures, 10 tables, 2 algorithms.

Figures (9)

  • Figure 1: Samples generated by our proposed framework. Our method achieves more effective decoupling and fusion of content and style, enabling finer control over both aspects during generation.
  • Figure 2: Overview of CRAFT-LoRA, a unified pipeline for personalized image synthesis. In the Training Stage, content ($\Delta W_c$) and style ($\Delta W_s$) LoRA adapters are decoupled from reference images using a rank-restricted initialization offset. The Prompt Text Guidance module employs an expert system with specialized branches to produce distinct content and style tokens for semantic control. In the Inference Stage, a timestep-aware asymmetric CFG scheme selectively integrates LoRA updates, ensuring stable and high-fidelity image generation.
  • Figure 3: Content-style separation via frequency domain decomposition. Each group shows an original image with its frequency-separated elements and additional samples from the same group. Left: content group, where low-frequency components capture structural and semantic information. Right: style group, where high-frequency components capture textures and artistic rendering.
  • Figure 4: Visual comparison of content-style combinations. The figure presents a systematic evaluation of different methods in combining specific content elements with distinct artistic styles. Each column showcases the content and style reference followed by the outputs generated. Prompts follow the format "A [content] <c> in [style] <s>". Competing methods often fail to simultaneously preserve structure and render style, whereas our method produces consistent and coherent content–style compositions.
  • Figure 5: Extended visual results. (a) Content–style generations augmented with additional prompt descriptions e.g."catching a frisbee", "wearing a hat", and "driving a car". Our method flexibly integrates dynamic semantics while preserving the specified content and style. (b) Single-branch generations where only the content LoRA or only the style LoRA is activated. Content-only generations preserve object identity across neutral renderings, while style-only generations transfer artistic features across different subjects.
  • ...and 4 more figures