Table of Contents
Fetching ...

AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

Xuanhua Yin, Chuanzhi Xu, Haoxian Zhou, Boyu Wei, Weidong Cai

Abstract

Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low-affinity regions evolve smoothly with redundant computation. Based on this insight, we propose AccelAes, a training-free framework that accelerates DiTs through aesthetics-aware spatio-temporal reduction while improving perceptual aesthetics. AccelAes builds AesMask, a one-shot aesthetic focus mask derived from prompt semantics and cross-attention signals. When localized computation is feasible, SkipSparse reallocates computation and guidance to masked regions. We further reduce temporal redundancy using a lightweight step-level prediction cache that periodically replaces full Transformer evaluations. Experiments on representative DiT families show consistent acceleration and improved aesthetics-oriented quality. On Lumina-Next, AccelAes achieves a 2.11$\times$ speedup and improves ImageReward by +11.9% over the dense baseline. Code is available at https://github.com/xuanhuayin/AccelAes.

AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

Abstract

Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low-affinity regions evolve smoothly with redundant computation. Based on this insight, we propose AccelAes, a training-free framework that accelerates DiTs through aesthetics-aware spatio-temporal reduction while improving perceptual aesthetics. AccelAes builds AesMask, a one-shot aesthetic focus mask derived from prompt semantics and cross-attention signals. When localized computation is feasible, SkipSparse reallocates computation and guidance to masked regions. We further reduce temporal redundancy using a lightweight step-level prediction cache that periodically replaces full Transformer evaluations. Experiments on representative DiT families show consistent acceleration and improved aesthetics-oriented quality. On Lumina-Next, AccelAes achieves a 2.11 speedup and improves ImageReward by +11.9% over the dense baseline. Code is available at https://github.com/xuanhuayin/AccelAes.
Paper Structure (51 sections, 13 equations, 12 figures, 13 tables)

This paper contains 51 sections, 13 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: AccelAes improves both efficiency and overall quality. Left: Qualitative comparison under identical prompts, with zoomed-in crops and per-sample ImageReward (IR). Right: Speed--quality summary, and a bottom-right multi-metric radar on Lumina benchmarking AccelAes against the dense baseline and representative training-free acceleration methods, demonstrating consistent gains across metrics.
  • Figure 2: The proposed AccelAes framework. (a) AccelAes builds AesMask from prompt semantics and cross-attention. It uses the mask for spatially adaptive computation and region-wise guidance. SkipSparse also includes a step-level cache with linear extrapolation to reduce redundant forwards across timesteps. (b) AesMask uses aesthetic anchors, CLIP similarity, and cross-attention aggregation, then applies percentile thresholding to obtain a binary mask. (c) SkipSparse updates attention and FFN mainly on aesthetic-focus tokens. It keeps global context with global keys/values and reuses predictions across steps via the cache.
  • Figure 3: Baseline comparisons on SD3 and FLUX. We compare AccelAes with the dense baseline and prior acceleration methods on SD3 and FLUX using the same prompts, seeds, and measurement protocol. AccelAes achieves a better speed quality trade off and improves aesthetics oriented scores while reducing sampling time.
  • Figure 4: AesMask analysis and ablation. We study how AesMask is constructed and how it affects acceleration and quality. We ablate key design choices such as anchor selection, cross attention aggregation, and mask thresholding, and we report their impact on both runtime and aesthetic metrics.
  • Figure 5: SkipSparse ablation. We evaluate variants that enable spatial allocation, step-level reuse, or both, and report speedup and ImageReward with and without AesMask.
  • ...and 7 more figures