Table of Contents
Fetching ...

A Training-Free Style-Personalization via SVD-Based Feature Decomposition

Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im

TL;DR

This work tackles fast, training-free style personalization for text- and image-guided generation by analyzing a scale-wise autoregressive backbone (Infinity). It identifies a pivotal early step where the dominant singular values of an internal feature capture style and introduces two lightweight modules—Principal Feature Blending and Structural Attention Correction—to inject style and stabilize structure without training. Through extensive experiments, the approach achieves competitive style and prompt fidelity while significantly reducing inference time compared to fine-tuned baselines, and it generalizes across model scales. The proposed method offers practical benefits for real-time, user-friendly style personalization with broad applicability to scale-wise autoregressive generation frameworks.

Abstract

We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.

A Training-Free Style-Personalization via SVD-Based Feature Decomposition

TL;DR

This work tackles fast, training-free style personalization for text- and image-guided generation by analyzing a scale-wise autoregressive backbone (Infinity). It identifies a pivotal early step where the dominant singular values of an internal feature capture style and introduces two lightweight modules—Principal Feature Blending and Structural Attention Correction—to inject style and stabilize structure without training. Through extensive experiments, the approach achieves competitive style and prompt fidelity while significantly reducing inference time compared to fine-tuned baselines, and it generalizes across model scales. The proposed method offers practical benefits for real-time, user-friendly style personalization with broad applicability to scale-wise autoregressive generation frameworks.

Abstract

We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.

Paper Structure

This paper contains 38 sections, 10 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Style-personalized image generation results produced by our method. Given reference style images and text prompts, our method generates images with consistent style and diverse content.
  • Figure 2: Step-wise prompt injection analysis. We intervene at each generation step $s \in \{1, \dots, 12\}$ by replacing the prompt only at step $\hat{s}$, while keeping all other steps fixed to the base prompt. Top: style prompt injection ("A photo of a black teddy bear" vs. "A photo of a white teddy bear"). Middle: content prompt injection ("A photo of a cupcake" vs. "A photo of a donut"). Bottom: CLIP similarity between the alternative prompt ${\hat{T}}$ and the corresponding image across steps.
  • Figure 3: Key step feature analysis. Content and style similarity are measured for Baseline, Full replacement, and SVD-guided outputs using a set of prompt pairs $\mathbf{T}$, with results averaged across all pairs.
  • Figure 4: Overall pipeline of our model. The text encoder processes an identical text prompt $T$ for both the content and generation paths, providing their embeddings to the autoregressive transformer. At stage $s=3$, Principal Feature Blend is applied to extract the principal style representation from the reference style image and seamlessly integrate it into the features of the generation path. Starting from $s=3$ (the fine stage), Structural Attention Correction aligns the generation path’s attention maps with those of the content path, ensuring stable and consistent structural guidance during refinement.
  • Figure 5: Qualitative comparison with state-of-the-art style-personalized image generation models.
  • ...and 9 more figures