Table of Contents
Fetching ...

LookupViT: Compressing visual information to a limited number of tokens

Rajat Koner, Gagan Jain, Prateek Jain, Volker Tresp, Sujoy Paul

TL;DR

LookupViT introduces a two-stream token framework that compresses visual data into $M$ tokens while maintaining a large lookup token set $N$, powered by a bidirectional Multi-Head Bidirectional Cross-Attention to exchange information between tokens. By restricting heavy computation to the compressed tokens and sharing context through MHBC, it achieves substantial FLOPs reductions (often >$2\times$) with comparable or improved accuracy across image, video, and captioning tasks. The approach supports multi-resolution tokenization, enabling a single trained model to offer multiple compute-performance trade-offs, and exhibits robustness improvements on several ImageNet corruption and distribution datasets. These properties make LookupViT a flexible, generalizable backbone for resource-constrained vision applications with potential extensions to dense prediction tasks and larger model scales.

Abstract

Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually sparse and redundant. In this work, we introduce LookupViT, that aims to exploit this information sparsity to reduce ViT inference cost. LookupViT provides a novel general purpose vision transformer block that operates by compressing information from higher resolution tokens to a fixed number of tokens. These few compressed tokens undergo meticulous processing, while the higher-resolution tokens are passed through computationally cheaper layers. Information sharing between these two token sets is enabled through a bidirectional cross-attention mechanism. The approach offers multiple advantages - (a) easy to implement on standard ML accelerators (GPUs/TPUs) via standard high-level operators, (b) applicable to standard ViT and its variants, thus generalizes to various tasks, (c) can handle different tokenization and attention approaches. LookupViT also offers flexibility for the compressed tokens, enabling performance-computation trade-offs in a single trained model. We show LookupViT's effectiveness on multiple domains - (a) for image-classification (ImageNet-1K and ImageNet-21K), (b) video classification (Kinetics400 and Something-Something V2), (c) image captioning (COCO-Captions) with a frozen encoder. LookupViT provides $2\times$ reduction in FLOPs while upholding or improving accuracy across these domains. In addition, LookupViT also demonstrates out-of-the-box robustness and generalization on image classification (ImageNet-C,R,A,O), improving by up to $4\%$ over ViT.

LookupViT: Compressing visual information to a limited number of tokens

TL;DR

LookupViT introduces a two-stream token framework that compresses visual data into tokens while maintaining a large lookup token set , powered by a bidirectional Multi-Head Bidirectional Cross-Attention to exchange information between tokens. By restricting heavy computation to the compressed tokens and sharing context through MHBC, it achieves substantial FLOPs reductions (often >) with comparable or improved accuracy across image, video, and captioning tasks. The approach supports multi-resolution tokenization, enabling a single trained model to offer multiple compute-performance trade-offs, and exhibits robustness improvements on several ImageNet corruption and distribution datasets. These properties make LookupViT a flexible, generalizable backbone for resource-constrained vision applications with potential extensions to dense prediction tasks and larger model scales.

Abstract

Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually sparse and redundant. In this work, we introduce LookupViT, that aims to exploit this information sparsity to reduce ViT inference cost. LookupViT provides a novel general purpose vision transformer block that operates by compressing information from higher resolution tokens to a fixed number of tokens. These few compressed tokens undergo meticulous processing, while the higher-resolution tokens are passed through computationally cheaper layers. Information sharing between these two token sets is enabled through a bidirectional cross-attention mechanism. The approach offers multiple advantages - (a) easy to implement on standard ML accelerators (GPUs/TPUs) via standard high-level operators, (b) applicable to standard ViT and its variants, thus generalizes to various tasks, (c) can handle different tokenization and attention approaches. LookupViT also offers flexibility for the compressed tokens, enabling performance-computation trade-offs in a single trained model. We show LookupViT's effectiveness on multiple domains - (a) for image-classification (ImageNet-1K and ImageNet-21K), (b) video classification (Kinetics400 and Something-Something V2), (c) image captioning (COCO-Captions) with a frozen encoder. LookupViT provides reduction in FLOPs while upholding or improving accuracy across these domains. In addition, LookupViT also demonstrates out-of-the-box robustness and generalization on image classification (ImageNet-C,R,A,O), improving by up to over ViT.
Paper Structure (19 sections, 4 equations, 7 figures, 9 tables, 4 algorithms)

This paper contains 19 sections, 4 equations, 7 figures, 9 tables, 4 algorithms.

Figures (7)

  • Figure 1: (a) Cross-attention maps between compressed and lookup tokens, emphasizing LookupViT's ability to extract relevant information from lookup tokens as needed for classification. (b) LookupViT vs ViT while scaling image resolution. The individual points per curve are for varied compressed tokens sizes ($3\times3, 5\times5, 7\times7, 10\times10$). LookupViT scales quite efficiently w.r.t ViT.
  • Figure 2: Bidirectional information flow in LookupViT block. LookupViT restricts the heavy computation to the compressed tokens, while extracting information from the lookup tokens. The lookup tokens then update themselves by reusing the information exchange computation.
  • Figure 3: LookupViT Architecture: The LookupViT block is stacked multiple times similar to vanilla ViT. Each LookupViT block has two parallel computation streams for the two different types of tokens. Heavy computation happens on a fixed smaller number of compressed tokens, while light computation happens on the much higher number of lookup tokens. There is an asynchronous information exchange between the two token sets using the Multi-Head Bi-Directional Cross Attention (MHBC) block.
  • Figure 4: (a) Density of normalized feature distance for severity=5 over all corruptions. (b) Mean normalized feature distance over all corruptions for different severity.
  • Figure 5: (a) Video classification on K400 with different spatio-temporal compressed tokens for LookupViViT (LViViT). Color denotes the number of temporal token and points on each curve are with increasing number of spatial tokens. (b) Training a single model ("Multi-Res") which can handle different number of compressed tokens, offering compute-performance trade-off with the same parameter space. The other models are trained individually but evaluated at all resolutions.
  • ...and 2 more figures