Table of Contents
Fetching ...

All-in-One Slider for Attribute Manipulation in Diffusion Models

Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang, Xuecheng Nie

Abstract

Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a \textbf{One-for-One} manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the \textbf{All-in-One} Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports the composition of multiple attributes and zero-shot manipulation of unseen attributes (e.g., races and celebrities). Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code is available on our project page.

All-in-One Slider for Attribute Manipulation in Diffusion Models

Abstract

Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a \textbf{One-for-One} manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the \textbf{All-in-One} Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports the composition of multiple attributes and zero-shot manipulation of unseen attributes (e.g., races and celebrities). Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code is available on our project page.

Paper Structure

This paper contains 30 sections, 6 equations, 19 figures, 3 tables, 1 algorithm.

Figures (19)

  • Figure 1: Our All-in-One Slider shows advantages in: (1) Fine-grained and continuous control over desired attribute, without affecting other attributes (e.g., subject identity and appearance). (2) Combination of multiple facial attributes (e.g., smile and age) for consistent and conflict-free transformations. (3) Zero-shot generalization to unseen attributes, without multiple and cumbersome training processes.
  • Figure 2: (1) Existing One-for-One slider methods require training a specific slider module for each attribute. (2) Our All-in-One slider only needs training once to obtain a unified representation for different attributes, supporting the flexible manipulation of multiple diverse attributes.
  • Figure 3: An overview of our All-in-One slider's framework. Stage 1: Unsupervised training of Attribute Sparse Autoencoder, which takes a token embedding obtained from the residual streamer in the text encoder as input and aims to reconstruct it with sparse features. Stage 2: Applying the trained Attribute Sparse Autoencoder to flexibly manipulate specific attributes during the image generation process.
  • Figure 4: Qualitative results of face attribute manipulation. Our All-in-One slider can perform both fine-grained semantic edits (e.g., smile, makeup, and age) and physical changes (e.g., eyeglasses, hat, hair style, and skin tone).
  • Figure 5: Qualitative results of compositional multi-attributes manipulation. Our All-in-One slider achieves coherent manipulation while preserving the original identity.
  • ...and 14 more figures