Table of Contents
Fetching ...

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

Simin Huo, Ning Li

TL;DR

Iwin Transformer presents a position-embedding-free vision backbone that achieves global information exchange within a single block by marrying interleaved window attention with depthwise separable convolution. The method preserves efficiency through a four-stage hierarchical design and cross-resolution fine-tuning, outperforming or matching Swin in image classification and video tasks, while showing competitive segmentation results and notable gains in generation-oriented uses. Ablation and diverse-task experiments validate the core design choices and demonstrate practical benefits for high-resolution workloads and potential extensions to generation and 3D data. While COCO object detection shows a task-specific gap to Swin, the overall approach offers a versatile, scalable alternative to standard self-attention, with promising implications for diffusion-based generation and large-scale language model adaptations.

Abstract

We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

TL;DR

Iwin Transformer presents a position-embedding-free vision backbone that achieves global information exchange within a single block by marrying interleaved window attention with depthwise separable convolution. The method preserves efficiency through a four-stage hierarchical design and cross-resolution fine-tuning, outperforming or matching Swin in image classification and video tasks, while showing competitive segmentation results and notable gains in generation-oriented uses. Ablation and diverse-task experiments validate the core design choices and demonstrate practical benefits for high-resolution workloads and potential extensions to generation and 3D data. While COCO object detection shows a task-specific gap to Swin, the overall approach offers a versatile, scalable alternative to standard self-attention, with promising implications for diffusion-based generation and large-scale language model adaptations.

Abstract

We introduce Iwin Transformer, a novel position-embedding-free hierarchical vision transformer, which can be fine-tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top-1 accuracy on ImageNet-1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self-attention module in class-conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin-Transformer.

Paper Structure

This paper contains 45 sections, 3 theorems, 15 equations, 12 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

In interleaved window attention, tokens at positions $(i_1,j_1)$ and $(i_2,j_2)$ are in the same attention window if and only if:

Figures (12)

  • Figure 1: Diagram of the proposed pattern. In (a), token 1 within the CNN can only interact with token 3 nearby and cannot reach token 7 over a long distance. Therefore, CNN is restricted to capturing local features. In contrast, token 1 in the ViT can be associated with any token, enabling the capture of global features but with a quadratic complexity. In the third proposed CNN+Transformer pattern, token 1 first connects with token 5 at a short distance through attention, and token 5 is related to token 7 via convolution. In this way, tokens 1 and 7, despite being far away, communicate indirectly. In (b) shows an intuitive top view of the proposed CNN+Transformer pattern.
  • Figure 2: Illustration of Iwin attention. In the left image, the green triangles and red stars representing tokens are connected through convolutions in the original image. In the right image, all green triangles representing tokens are assigned to the same window through the RTR (Reshape-Transpose-Reshape) operation and window segmentation, executing window attention to establish connections among them. All red stars representing tokens do the same thing. The result is that global convolution and window attention on the interleaved sequence work together to effectively approximate standard global attention, which means that connections are established between any tokens in the original image.
  • Figure 3: Diagram of Iwin Block. (a) S1 shows a parallel structure where convolution and attention results are directly combined, as implemented in this study (most fast). (b) S2 is a parallel scheme with independent convolution and attention connections to input, exhibiting the poorest performance. (c) S3 is a serial configuration, where attention input receives convolution output, performing slightly better than S1 but requires one more layer normalization, increasing computation.
  • Figure 4: The visualization of heatmap. The left column shows input images, while subsequent columns show results from native VIT, PVTv2, Swin, and Iwin (our method). Results demonstrate that Iwin effectively concentrates activation on target objects.
  • Figure 5: The visualization of object detection on the COCO2017. The leftmost column shows the input images. From left to right, the results generated by PVTv2-based, Swin-based, and Iwin-based Mask R-CNN are shown.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Lemma 1: Modular Property of Interleaved Window Attention
  • Lemma 2: Locality of Depthwise Separable Convolution
  • Theorem 3: Global Information Exchange Condition