Table of Contents
Fetching ...

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

Quang-Binh Nguyen, Minh Luu, Quang Nguyen, Anh Tran, Khoi Nguyen

Abstract

Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

Abstract

Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

Paper Structure

This paper contains 19 sections, 10 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Given a single input image, our framework separates content and style, enabling flexible recontextualization and stylization to generate new images across diverse contexts.
  • Figure 2: Overview of our CSD-VAR. During optimization (left), a content-style prompt $\mathbf{y}$, "A photo of a <$y_c$> object in <$y_s$> style", is encoded into text embeddings $\mathbf{e}$. The rectified style embedding $e_s$ reduces content leakage, while ground-truth scale-wise tokens from VQ-VAE are interpolated for next-scale prediction. Augmented K-V memories are prepended at specific scales before feeding into the autoregressive transformer. The model is trained with scale-wise cross-entropy losses, alternating optimization of content and style embeddings. At inference (right), style or content K-V memories are prepended based on the prompt before predicting tokens.
  • Figure 3: Analysis of style-related scores across different scales.
  • Figure 4: Style embedding rectification and examples
  • Figure 5: Statistics and samples of the CSD-100 dataset.
  • ...and 9 more figures