Table of Contents
Fetching ...

One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

Haoyu Wu, Jingyi Xu, Qiaomu Miao, Dimitris Samaras, Hieu Le

TL;DR

This work identifies a fundamental failure mode of RoPE-based attention in diffusion transformers when running mixed-resolution denoising, caused by cross-rate phase aliasing from naive interpolation. It introduces Cross-Resolution Phase-Aligned Attention (CRPA), a training-free, drop-in mechanism that reindexes RoPE phases onto the query grid so equal physical distances produce identical phase increments, thereby stabilizing all heads across resolutions. An optional Boundary Expand-and-Replace step further harmonizes textures around resolution boundaries. Together, CRPA and boundary expansion enable stable, high-fidelity image and video generation with mixed-resolution diffusion transformers and offer practical gains in efficiency by focusing high resolution where it matters most.

Abstract

We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query's stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

TL;DR

This work identifies a fundamental failure mode of RoPE-based attention in diffusion transformers when running mixed-resolution denoising, caused by cross-rate phase aliasing from naive interpolation. It introduces Cross-Resolution Phase-Aligned Attention (CRPA), a training-free, drop-in mechanism that reindexes RoPE phases onto the query grid so equal physical distances produce identical phase increments, thereby stabilizing all heads across resolutions. An optional Boundary Expand-and-Replace step further harmonizes textures around resolution boundaries. Together, CRPA and boundary expansion enable stable, high-fidelity image and video generation with mixed-resolution diffusion transformers and offer practical gains in efficiency by focusing high resolution where it matters most.

Abstract

We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query's stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

Paper Structure

This paper contains 41 sections, 27 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Mixed-Resolution Denoising. (a) Naïve mixed-resolution denoising, with the high-resolution region outlined in red, collapses due to RoPE su2024roformer phase mismatches between resolutions, producing severe blur and instability. (b) Our Cross-Resolution Phase-Aligned Attention (CRPA) keeps RoPE phases synchronized across scales, restoring sharp and consistent detail.
  • Figure 2: Results for RoPE with linear position chen2023extending interpolation (PI) to the low- or high-resolution grid.
  • Figure 3: Attention Scores vs. RoPE relative distance $\Delta$. Mean normalized scores $\kappa(\Delta)$ on Wan model wan2025wan across diffusion steps $t\in\{428,749,922\}$. For each axis (time, height, width), curves are averaged over all attention heads and over RoPE-dominant heads, where RoPE dominance is defined by a head-level RoPE-dominance score ($rds$); heads with $rds>0.085$ are classified as RoPE-dominant. The relative distance $\Delta$ denotes token offsets along the corresponding axis. We observe (i) strong periodicity with a sharp global maximum near $\Delta\approx 0$, (ii) amplification in RoPE-dominant heads, and (iii) stability across timesteps, suggesting a pretrained phase prior.
  • Figure 4: Cross-Resolution Phase-Aligned Attention (CRPA). For each attention call, RoPE indices of keys are rescaled onto the query grid so that equal physical distances yield identical phase increments, eliminating cross-rate aliasing and enabling stable mixed-resolution denoising with arbitrary LR/HR layouts.
  • Figure 5: Boundary Expand-and-Replace. Around LR--HR boundaries, we dilate the masks and bidirectionally exchange upsampled and downsampled latent content within a narrow band, harmonizing textures while adding negligible overhead.
  • ...and 9 more figures