Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings
Liang Hou, Cong Liu, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai
TL;DR
Diffusion transformers struggle to generalize to higher resolutions due to positional encoding misalignment between training and inference. The authors propose 2D Randomized Positional Encodings (RPE-2D), which randomly sample 2D patch positions along the horizontal and vertical axes during training, ensuring test-time PEs are within the training distribution and turning high-resolution extrapolation into interpolation. They augment this with random resize-and-crop, micro-conditioning, attention-scale adjustments, and timestep-shift mapping to stabilize high-resolution sampling and preserve image structure. Experational results on ImageNet show state-of-the-art resolution generalization across multiple extrapolation settings, along with strong low-resolution generation and improved multi-stage training efficiency, while remaining compatible with existing PE families. RPE-2D thus offers a practical, PE-centric path to scalable, multi-resolution diffusion transformers.
Abstract
Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a key obstacle for diffusion transformers in addressing this problem is the mismatch between positional encodings seen at inference and those used during training. Existing strategies such as positional encodings interpolation, extrapolation, or hybrids, do not fully resolve this mismatch. In this paper, we propose a novel two-dimensional randomized positional encodings, namely RPE-2D, that prioritizes the order of image patches rather than their absolute distances, enabling seamless high- and low-resolution generation without training on multiple resolutions. Concretely, RPE-2D independently samples positions along the horizontal and vertical axes over an expanded range during training, ensuring that the encodings used at inference lie within the training distribution and thereby improving resolution generalization. We further introduce a simple random resize-and-crop augmentation to strengthen order modeling and add micro-conditioning to indicate the applied cropping pattern. On the ImageNet dataset, RPE-2D achieves state-of-the-art resolution generalization performance, outperforming competitive methods when trained at $256^2$ and evaluated at $384^2$ and $512^2$, and when trained at $512^2$ and evaluated at $768^2$ and $1024^2$. RPE-2D also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration, and multi-resolution inheritance.
