Table of Contents
Fetching ...

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

Wenzhuo Liu, Fei Zhu, Shijie Ma, Cheng-Lin Liu

TL;DR

This paper addresses the challenge of Vision Transformers operating on images with variable resolutions, a common scenario in the wild. It introduces Multi-Scale Patch Embedding (MSPE), a lightweight replacement for the patch embedding layer that uses multiple adaptive kernels and pseudo-inverse based resizing to handle arbitrary sizes without resizing the input. By training only the patch-embedding parameters with mixed-resolution data, MSPE achieves superior or competitive performance across image classification, semantic segmentation, and object detection while maintaining low training cost. The work demonstrates that focusing on the embedding stage yields robust, real-world usable ViT models, potentially transforming how transformers are deployed in diverse imaging environments.

Abstract

Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution, such as 224x224, for efficiency during training and inference. However, uniform input size conflicts with real-world scenarios where images naturally vary in resolution. Modifying the preset resolution of a model may severely degrade the performance. In this work, we propose to enhance the model adaptability to resolution variation by optimizing the patch embedding. The proposed method, called Multi-Scale Patch Embedding (MSPE), substitutes the standard patch embedding with multiple variable-sized patch kernels and selects the best parameters for different resolutions, eliminating the need to resize the original image. Our method does not require high-cost training or modifications to other parts, making it easy to apply to most ViT models. Experiments in image classification, segmentation, and detection tasks demonstrate the effectiveness of MSPE, yielding superior performance on low-resolution inputs and performing comparably on high-resolution inputs with existing methods.

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

TL;DR

This paper addresses the challenge of Vision Transformers operating on images with variable resolutions, a common scenario in the wild. It introduces Multi-Scale Patch Embedding (MSPE), a lightweight replacement for the patch embedding layer that uses multiple adaptive kernels and pseudo-inverse based resizing to handle arbitrary sizes without resizing the input. By training only the patch-embedding parameters with mixed-resolution data, MSPE achieves superior or competitive performance across image classification, semantic segmentation, and object detection while maintaining low training cost. The work demonstrates that focusing on the embedding stage yields robust, real-world usable ViT models, potentially transforming how transformers are deployed in diverse imaging environments.

Abstract

Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution, such as 224x224, for efficiency during training and inference. However, uniform input size conflicts with real-world scenarios where images naturally vary in resolution. Modifying the preset resolution of a model may severely degrade the performance. In this work, we propose to enhance the model adaptability to resolution variation by optimizing the patch embedding. The proposed method, called Multi-Scale Patch Embedding (MSPE), substitutes the standard patch embedding with multiple variable-sized patch kernels and selects the best parameters for different resolutions, eliminating the need to resize the original image. Our method does not require high-cost training or modifications to other parts, making it easy to apply to most ViT models. Experiments in image classification, segmentation, and detection tasks demonstrate the effectiveness of MSPE, yielding superior performance on low-resolution inputs and performing comparably on high-resolution inputs with existing methods.
Paper Structure (30 sections, 10 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 10 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: MSPE results on ImageNet-1K. We loaded a ViT-B model pre-trained on ImageNet-21K from steiner2022how and evaluated: (a) Height equals width, ranging from 28$\times$28 to 896$\times$896, and (b) Fixed height=128, width ranging from 28 to 896. Vanilla ViT performance drops with size/aspect ratio changes; FlexiViT beyer2023flexivit significantly improves performance, and our method surpasses FlexiVIT.
  • Figure 2: Similarity in patch embeddings does not guarantee optimal performance (a). We confirm this by evaluating the accuracy and cosine similarity of: (b) patch embeddings $\{\bm{z}_i\}_{i=1}^N$from 56$\times$56 and 224$\times$224 images, and (c) class tokens $\bm{z}_{\text{cls}}$ from 56$\times$56 and 224$\times$224 images.
  • Figure 3: Illustration of the ViT model dosovitskiyimagetouvron2021training with MSPE. MSPE only replaces the patch embedding layer in the vanilla model, making well-trained ViT models to be directly applied to any size and aspect ratio. In our method, the patch embedding layer has several variable-sized kernels. The Transformer encoder is shared and frozen.
  • Figure 4: ImageNet-1K Top-1 accuracy curves, fixed heights at 192, 256, and 384. Results show MSPE directly applied across varying input ratios and enhancing performance.
  • Figure 5: Comparison of MSPE, Vanilla, and NaViT: only NaViT was pre-trained on the JFT dataset, baseline results come from dehghani2024patch.
  • ...and 6 more figures