Table of Contents
Fetching ...

Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

Jiaye Li, Baoyou Chen, Hui Li, Zilong Dong, Jingdong Wang, Siyu Zhu

TL;DR

HARoPE addresses the limitations of multi-dimensional RoPE in fine-grained image generation by introducing a head-wise adaptive, SVD-parameterized linear pre-transform before the rotary mapping. This design enables dynamic frequency reallocation, semantic alignment of rotary planes, cross-axis interactions, and head-specific positional receptive fields while preserving RoPE's relative-offset property. Empirical results across image understanding, class-conditioned image generation, and text-to-image synthesis show consistent improvements over strong RoPE baselines and other extensions, validating HARoPE as an effective drop-in enhancement for transformer-based vision generation. The approach offers a principled way to inject adaptive, per-head positional reasoning into large generative models, improving spatial discrimination, color fidelity, and object counting in complex prompts.

Abstract

Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.

Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

TL;DR

HARoPE addresses the limitations of multi-dimensional RoPE in fine-grained image generation by introducing a head-wise adaptive, SVD-parameterized linear pre-transform before the rotary mapping. This design enables dynamic frequency reallocation, semantic alignment of rotary planes, cross-axis interactions, and head-specific positional receptive fields while preserving RoPE's relative-offset property. Empirical results across image understanding, class-conditioned image generation, and text-to-image synthesis show consistent improvements over strong RoPE baselines and other extensions, validating HARoPE as an effective drop-in enhancement for transformer-based vision generation. The approach offers a principled way to inject adaptive, per-head positional reasoning into large generative models, improving spatial discrimination, color fidelity, and object counting in complex prompts.

Abstract

Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.

Paper Structure

This paper contains 38 sections, 10 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Qualitative comparison of generated images across three fine-grained challenges: spatial relations (left), color fidelity (middle), and object counting (right). HARoPE consistently outperforms RoPE, adhering more faithfully to prompt specifications (instruction keywords highlighted in red).
  • Figure 2: Qualitative comparison on wild prompts, evaluating FLUX models with RoPE and HARoPE positional embeddings.
  • Figure 3: Qualitative comparison of different matrix settings. During the inference steps, we demonstrate the "NM" denotes normal matrix, and "OM" denotes orthogonal matrix.
  • Figure 4: Model weight in heatmap of different learned matrices in different attention heads and different blocks.
  • Figure 5: Comparing the performance of RoPE + SVD and RoPE + SVD + Multi-head on GenEval benchmark, FLUX model, 1024$\times$1024 resolution.
  • ...and 4 more figures