Table of Contents
Fetching ...

Win-Win: Training High-Resolution Vision Transformers from Two Windows

Vincent Leroy, Jerome Revaud, Thomas Lucas, Philippe Weinzaepfel

TL;DR

Win-Win tackles the cost of training high-resolution vision transformers by masking most input tokens and training on two non-overlapping windows, enabling the model to learn both local and global token interactions. It uses RoPE-based relative positional embeddings and convolutional heads, and supports direct full-resolution inference at test time without tiling. Across semantic segmentation and optical flow benchmarks, Win-Win substantially reduces training time (about 3–4x) and memory usage while achieving competitive or state-of-the-art performance, including on Full-HD Spring with faster inference than tiling-based methods. The approach offers a general, scalable path to efficient high-resolution ViTs applicable to both monocular and binocular dense prediction tasks.

Abstract

Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to three dense prediction tasks with high-resolution data. First, we show on the task of semantic segmentation that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. Second, we confirm this result on the task of monocular depth prediction. Third, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an order of magnitude faster inference than the best competitor.

Win-Win: Training High-Resolution Vision Transformers from Two Windows

TL;DR

Win-Win tackles the cost of training high-resolution vision transformers by masking most input tokens and training on two non-overlapping windows, enabling the model to learn both local and global token interactions. It uses RoPE-based relative positional embeddings and convolutional heads, and supports direct full-resolution inference at test time without tiling. Across semantic segmentation and optical flow benchmarks, Win-Win substantially reduces training time (about 3–4x) and memory usage while achieving competitive or state-of-the-art performance, including on Full-HD Spring with faster inference than tiling-based methods. The approach offers a general, scalable path to efficient high-resolution ViTs applicable to both monocular and binocular dense prediction tasks.

Abstract

Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local interactions between tokens inside each window, and global interactions between tokens from different windows. As a result, the model can directly process the high-resolution input at test time without any special trick. We show that this strategy is effective when using relative positional embedding such as rotary embeddings. It is 4 times faster to train than a full-resolution network, and it is straightforward to use at test time compared to existing approaches. We apply this strategy to three dense prediction tasks with high-resolution data. First, we show on the task of semantic segmentation that a simple setting with 2 windows performs best, hence the name of our method: Win-Win. Second, we confirm this result on the task of monocular depth prediction. Third, we further extend it to the binocular task of optical flow, reaching state-of-the-art performance on the Spring benchmark that contains Full-HD images with an order of magnitude faster inference than the best competitor.
Paper Structure (42 sections, 1 equation, 8 figures, 8 tables)

This paper contains 42 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Validation performance vs. training time on semantic segmentation (left) and optical flow (middle). We compare our two-window training (Win-Win) to a standard full-resolution training as well as a sparsification of the attention following ViT-Det vitdet. We indicate the memory usage in parenthesis in the legend. Compared to full-resolution training, Win-Win allows to reduce the training time by a factor $3{\sim}4$ and to half the memory usage while reaching a similar performance. Training and inference times on optical flow, for Win-Win vs. other strategies (right). ViT+Tiling corresponds to a setup similar to CroCo-Flow crocostereo where the model is trained on random crops, but requires a tiling strategy at inference. While Win-Win is as fast to train as the latter, it can directly process full-resolution inputs at test time.
  • Figure 2: Overview of Win-Win, our approach for high-resolution training of ViTs. We show that certain masking configurations can generalize to full-resolution at inference time. Specifically, using 2 random squares, which allows to model both local interactions inside each square and global interactions with patches from the other square, is enough. This also offers the advantage of speeding up training and decreasing memory usage considerably, since most image patches are discarded. Our framework is general and applies to binocular tasks as well, e.g., optical flow (see Figure \ref{['fig:wit_flow']}).
  • Figure 3: Overview of Win-Win for the task of optical flow estimation. Masking is performed asymmetrically so that selected windows in the second frame are more likely to correspond to windows randomly selected in the first frame. The rest of the framework remains identical.
  • Figure 4: Example result on Spring test set. For the error maps, blue and red denote low and high errors, respectively. Block artefacts are clearly visible for CroCo-Flow but notably absent for Win-Win.
  • Figure 5: Robustness to test resolution. We compare the performance of a model trained with Win-Win, on crops, and at full-resolution when varying the test resolution. Some performance decrease at lower resolution can be explained by the smaller context available, this is why we show in a dashed line the performance when training at the target resolution.
  • ...and 3 more figures