Table of Contents
Fetching ...

Rotary Position Embedding for Vision Transformer

Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun

TL;DR

Rotary Position Embedding (RoPE) is extended from language models to 2D vision transformers to enable extrapolation across resolutions. The authors introduce RoPE-Mixed, a 2D RoPE variant with mixed axis frequencies and learnable parameters, to capture diagonal relative positions; they compare axial RoPE and mixed RoPE in ViT and Swin architectures across ImageNet-1k, COCO, and ADE20k. They demonstrate strong extrapolation performance, outperforming absolute positional embedding (APE) and relative position bias (RPB) in high-resolution inference and dense prediction, with negligible computational overhead. The work provides practical guidelines for applying RoPE to ViTs and releases code and pre-trained models, suggesting RoPE-Mixed as a generally applicable position embedding for vision tasks with improved backbone performance.

Abstract

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit

Rotary Position Embedding for Vision Transformer

TL;DR

Rotary Position Embedding (RoPE) is extended from language models to 2D vision transformers to enable extrapolation across resolutions. The authors introduce RoPE-Mixed, a 2D RoPE variant with mixed axis frequencies and learnable parameters, to capture diagonal relative positions; they compare axial RoPE and mixed RoPE in ViT and Swin architectures across ImageNet-1k, COCO, and ADE20k. They demonstrate strong extrapolation performance, outperforming absolute positional embedding (APE) and relative position bias (RPB) in high-resolution inference and dense prediction, with negligible computational overhead. The work provides practical guidelines for applying RoPE to ViTs and releases code and pre-trained models, suggesting RoPE-Mixed as a generally applicable position embedding for vision tasks with improved backbone performance.

Abstract

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit
Paper Structure (32 sections, 16 equations, 6 figures, 15 tables)

This paper contains 32 sections, 16 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: 2D Fourier reconstruction with RoPE frequencies. We perform a Fast Fourier Transform (FFT) followed by an inverse FFT with only RoPE frequencies to evaluate the representation capabilities of RoPE frequencies
  • Figure 2: Attention distances of ViT-B for APE/RoPE. We measure the average distance of attention interaction by computing the distance between query-key tokens from attention probabilities. We average the distance across the validation set.
  • Figure 3: Entropy of attention in ViT-B with APE or RoPE. Entropy of attention probability is measured for every self-attention of ViT-B. A high entropy value indicates that a large number of tokens are involved in the attention interaction.
  • Figure 4: Multi-resolution performance of ViTs. We apply two variants of 2D RoPE, RoPE-Axial, and RoPE-Mixed, to the ViT architectures. All ViTs are trained on ImageNet-1k deng2009imagenet with DeiT-III touvron2022deit3's 400 epochs training recipe.
  • Figure 5: Multi-resolution performance of Swin Transformers. We replace RPB in Swin Transformers with 2D RoPE variants: RoPE-Axial and RoPE-Mixed. Various Swin Transformers are trained with their 300 epochs training recipe liu2021swin. For multi-resolution inference, we change the window size of the window attention.
  • ...and 1 more figures