Table of Contents
Fetching ...

Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

Liang Hou, Cong Liu, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai

TL;DR

Diffusion transformers struggle to generalize to higher resolutions due to positional encoding misalignment between training and inference. The authors propose 2D Randomized Positional Encodings (RPE-2D), which randomly sample 2D patch positions along the horizontal and vertical axes during training, ensuring test-time PEs are within the training distribution and turning high-resolution extrapolation into interpolation. They augment this with random resize-and-crop, micro-conditioning, attention-scale adjustments, and timestep-shift mapping to stabilize high-resolution sampling and preserve image structure. Experational results on ImageNet show state-of-the-art resolution generalization across multiple extrapolation settings, along with strong low-resolution generation and improved multi-stage training efficiency, while remaining compatible with existing PE families. RPE-2D thus offers a practical, PE-centric path to scalable, multi-resolution diffusion transformers.

Abstract

Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a key obstacle for diffusion transformers in addressing this problem is the mismatch between positional encodings seen at inference and those used during training. Existing strategies such as positional encodings interpolation, extrapolation, or hybrids, do not fully resolve this mismatch. In this paper, we propose a novel two-dimensional randomized positional encodings, namely RPE-2D, that prioritizes the order of image patches rather than their absolute distances, enabling seamless high- and low-resolution generation without training on multiple resolutions. Concretely, RPE-2D independently samples positions along the horizontal and vertical axes over an expanded range during training, ensuring that the encodings used at inference lie within the training distribution and thereby improving resolution generalization. We further introduce a simple random resize-and-crop augmentation to strengthen order modeling and add micro-conditioning to indicate the applied cropping pattern. On the ImageNet dataset, RPE-2D achieves state-of-the-art resolution generalization performance, outperforming competitive methods when trained at $256^2$ and evaluated at $384^2$ and $512^2$, and when trained at $512^2$ and evaluated at $768^2$ and $1024^2$. RPE-2D also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration, and multi-resolution inheritance.

Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

TL;DR

Diffusion transformers struggle to generalize to higher resolutions due to positional encoding misalignment between training and inference. The authors propose 2D Randomized Positional Encodings (RPE-2D), which randomly sample 2D patch positions along the horizontal and vertical axes during training, ensuring test-time PEs are within the training distribution and turning high-resolution extrapolation into interpolation. They augment this with random resize-and-crop, micro-conditioning, attention-scale adjustments, and timestep-shift mapping to stabilize high-resolution sampling and preserve image structure. Experational results on ImageNet show state-of-the-art resolution generalization across multiple extrapolation settings, along with strong low-resolution generation and improved multi-stage training efficiency, while remaining compatible with existing PE families. RPE-2D thus offers a practical, PE-centric path to scalable, multi-resolution diffusion transformers.

Abstract

Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a key obstacle for diffusion transformers in addressing this problem is the mismatch between positional encodings seen at inference and those used during training. Existing strategies such as positional encodings interpolation, extrapolation, or hybrids, do not fully resolve this mismatch. In this paper, we propose a novel two-dimensional randomized positional encodings, namely RPE-2D, that prioritizes the order of image patches rather than their absolute distances, enabling seamless high- and low-resolution generation without training on multiple resolutions. Concretely, RPE-2D independently samples positions along the horizontal and vertical axes over an expanded range during training, ensuring that the encodings used at inference lie within the training distribution and thereby improving resolution generalization. We further introduce a simple random resize-and-crop augmentation to strengthen order modeling and add micro-conditioning to indicate the applied cropping pattern. On the ImageNet dataset, RPE-2D achieves state-of-the-art resolution generalization performance, outperforming competitive methods when trained at and evaluated at and , and when trained at and evaluated at and . RPE-2D also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration, and multi-resolution inheritance.

Paper Structure

This paper contains 25 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of RPE-2D for training and inference. During training (left), row and column indices are randomly sampled without replacement from the maximal grid $H \times W$ and sorted to form a set of 2D positions matching the training resolution. During inference (right), a deterministic, approximately equidistant grid matching the inference resolution is used.
  • Figure 2: Qualitative results of RPE-2D against different positional encoding extrapolation methods at different resolutions.
  • Figure 3: Generated images at different resolutions, including $128\times 128$, $256\times 256$, $512\times 512$, $768\times 768$, and $1024\times 1024$, where the model is trained only at resolutions of $256\times 256$ and $512\times 512$.
  • Figure 4: Training loss and FID curves of RoPE and RPE-2D.