Table of Contents
Fetching ...

AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models

Die Chen, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

TL;DR

AttriCtrl tackles the challenge of fine-grained, continuous aesthetic attribute control in diffusion-based image synthesis by quantifying attributes on a unified $[0,1]$ scale and introducing a lightweight value encoder that injects learnable token sequences into the conditioning of a frozen diffusion backbone. It combines direct metrics (brightness, detail) with CLIP-based realism and safety proxies to obtain interpretable attribute scores, which are then normalized and fed through a modular encoder to achieve disentangled, attribute-specific control. The approach is validated on single- and multi-attribute scenarios, outperforming baselines in control accuracy and safety suppression, and demonstrates seamless compatibility with ControlNet and related frameworks. Overall, AttriCtrl enables precise, compositional aesthetic manipulation with minimal model modification, paving the way for mixing-console–style, plug-and-play control in diffusion-based generation and potential generalization to a broader class of semantic attributes.”

Abstract

Diffusion models have recently become the dominant paradigm for image generation, yet existing systems struggle to interpret and follow numeric instructions for adjusting semantic attributes. In real-world creative scenarios, especially when precise control over aesthetic attributes is required, current methods fail to provide such controllability. This limitation partly arises from the subjective and context-dependent nature of aesthetic judgments, but more fundamentally stems from the fact that current text encoders are designed for discrete tokens rather than continuous values. Meanwhile, efforts on aesthetic alignment, often leveraging reinforcement learning, direct preference optimization, or architectural modifications, primarily align models with a global notion of human preference. While these approaches improve user experience, they overlook the multifaceted and compositional nature of aesthetics, underscoring the need for explicit disentanglement and independent control of aesthetic attributes. To address this gap, we introduce AttriCtrl, a lightweight framework for continuous aesthetic intensity control in diffusion models. It first defines relevant aesthetic attributes, then quantifies them through a hybrid strategy that maps both concrete and abstract dimensions onto a unified $[0,1]$ scale. A plug-and-play value encoder is then used to transform user-specified values into model-interpretable embeddings for controllable generation. Experiments show that AttriCtrl achieves accurate and continuous control over both single and multiple aesthetic attributes, significantly enhancing personalization and diversity. Crucially, it is implemented as a lightweight adapter while keeping the diffusion model frozen, ensuring seamless integration with existing frameworks such as ControlNet at negligible computational cost.

AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models

TL;DR

AttriCtrl tackles the challenge of fine-grained, continuous aesthetic attribute control in diffusion-based image synthesis by quantifying attributes on a unified scale and introducing a lightweight value encoder that injects learnable token sequences into the conditioning of a frozen diffusion backbone. It combines direct metrics (brightness, detail) with CLIP-based realism and safety proxies to obtain interpretable attribute scores, which are then normalized and fed through a modular encoder to achieve disentangled, attribute-specific control. The approach is validated on single- and multi-attribute scenarios, outperforming baselines in control accuracy and safety suppression, and demonstrates seamless compatibility with ControlNet and related frameworks. Overall, AttriCtrl enables precise, compositional aesthetic manipulation with minimal model modification, paving the way for mixing-console–style, plug-and-play control in diffusion-based generation and potential generalization to a broader class of semantic attributes.”

Abstract

Diffusion models have recently become the dominant paradigm for image generation, yet existing systems struggle to interpret and follow numeric instructions for adjusting semantic attributes. In real-world creative scenarios, especially when precise control over aesthetic attributes is required, current methods fail to provide such controllability. This limitation partly arises from the subjective and context-dependent nature of aesthetic judgments, but more fundamentally stems from the fact that current text encoders are designed for discrete tokens rather than continuous values. Meanwhile, efforts on aesthetic alignment, often leveraging reinforcement learning, direct preference optimization, or architectural modifications, primarily align models with a global notion of human preference. While these approaches improve user experience, they overlook the multifaceted and compositional nature of aesthetics, underscoring the need for explicit disentanglement and independent control of aesthetic attributes. To address this gap, we introduce AttriCtrl, a lightweight framework for continuous aesthetic intensity control in diffusion models. It first defines relevant aesthetic attributes, then quantifies them through a hybrid strategy that maps both concrete and abstract dimensions onto a unified scale. A plug-and-play value encoder is then used to transform user-specified values into model-interpretable embeddings for controllable generation. Experiments show that AttriCtrl achieves accurate and continuous control over both single and multiple aesthetic attributes, significantly enhancing personalization and diversity. Crucially, it is implemented as a lightweight adapter while keeping the diffusion model frozen, ensuring seamless integration with existing frameworks such as ControlNet at negligible computational cost.

Paper Structure

This paper contains 20 sections, 9 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview. Methods such as 'Add to Prompt' and 'Control with Kontext' fail to establish stable or reliable attribute control. In contrast, our proposed AttriCtrl enables fine-grained control over aesthetic attributes by modulating their intensity in the generated image.
  • Figure 2: Examples of aesthetic attribute intensities in the training dataset. We show the raw values computed via quantitative metrics and the normalized values after value mapping, scaled to the $[0,1]$.
  • Figure 3: Framework. We trains a value encoder that maps a normalized attribute intensity value to multi-scale representations, which are concatenated with text prompts and injected into the DiT.
  • Figure 4: Qualitative comparison of different control methods. Given a target attribute intensity value, we visualize the absolute difference (Diff $\downarrow$) between the generated images and the target.
  • Figure 5: Performance comparison on the I2P dataset. AttriCtrl achieves a total removal rate (RR $\uparrow$) of 57.7%, outperforming all baselines, including ESD (53.9%), SLD (32.6%) and NP (11.6%).
  • ...and 8 more figures