Table of Contents
Fetching ...

RAViT: Resolution-Adaptive Vision Transformer

Martial Guidez, Stefan Duffner, Christophe Garcia

TL;DR

A novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy.

Abstract

Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.

RAViT: Resolution-Adaptive Vision Transformer

TL;DR

A novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy.

Abstract

Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.
Paper Structure (11 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Example of a RAViT architecture on a CIFAR-10 image.
  • Figure 2: Framework of our RAViT architecture with 3 branches. Note that the number of branches can be increased or decreased without changing radically the architecture
  • Figure 3: Performance analysis of different RAViT models with 2 branches used on CIFAR-10. (a) Accuracy for different layer numbers on CIFAR-10. (b) Number of FLOPs for different layer numbers on CIFAR-10.
  • Figure 4: Tiny ImageNet results. (a) Test accuracy in relation to FLOPs for different architectures. (b) Accuracy (written on top of each columns) and exit distributions for different values of the early-exit threshold applied to a 2-0-3 network.
  • Figure 5: Performance analysis of different RAViT trained on Tiny ImageNet with 3 branches and 3 layers on the third one. To be compared with a 4-layer classical ViT of 0.41 accuracy and 223.3 GFLOPs. (a) Accuracy for different models having 3 branch-3 layers. (b) GFLOPs for different models having 3 branch-3 layers.
  • ...and 1 more figures