Table of Contents
Fetching ...

Context-Aware Token Selection and Packing for Enhanced Vision Transformer

Tianyi Zhang, Baoxin Li, Jae-sun Seo, Yu Cao

TL;DR

A novel algorithm, Select and Pack Attention (SPA), dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference.

Abstract

In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.

Context-Aware Token Selection and Packing for Enhanced Vision Transformer

TL;DR

A novel algorithm, Select and Pack Attention (SPA), dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference.

Abstract

In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.

Paper Structure

This paper contains 14 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Previous sparse attention methods either reduce computation only during the inference stage or require padding the length of selected tokens to the maximum within a batch, which inevitably introduces background tokens. This leads to reduced efficiency and worse accuracy compared to our SPA.
  • Figure 2: Overall architecture of Select and Packing Transformer (SPT). The hierarchical structure can generate features with various scales as common backbone networks. The SPA blocks in the last two stages can improve both efficiency and accuracy by disregarding uninformative tokens.
  • Figure 3: (a) Our SPA computes attention only for informative tokens. (b) Our SnP block selects informative tokens under multi-scale supervision and packs selected tokens for batch training and inference. The packed tokens attend to only tokens from the same image.
  • Figure 4: Under ground truth (GT) supervision, attending to only informative tokens can achieve better performance and efficiency.
  • Figure 5: We overlay the summation of the selection masks generated by all SPA blocks on the original image. Warm color denotes high frequency of selection while cold color means be pruned before the attention computation. With the supervision of multi-scale select labels, the selection process becomes significantly more accurate.