TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration

Jan Olszewski; Dawid Rymarczyk; Piotr Wójcik; Mateusz Pach; Bartosz Zieliński

TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration

Jan Olszewski, Dawid Rymarczyk, Piotr Wójcik, Mateusz Pach, Bartosz Zieliński

TL;DR

A novel approach to AVE called TORE, which divides the encoder into extractor and aggregator components, enabling the reuse of tokens passed to the aggrega-tor and reducing computational overhead by up to 90%.

Abstract

Active Visual Exploration (AVE) optimizes the utilization of robotic resources in real-world scenarios by sequentially selecting the most informative observations. However, modern methods require a high computational budget due to processing the same observations multiple times through the autoencoder transformers. As a remedy, we introduce a novel approach to AVE called TOken REcycling (TORE). It divides the encoder into extractor and aggregator components. The extractor processes each observation separately, enabling the reuse of tokens passed to the aggregator. Moreover, to further reduce the computations, we decrease the decoder to only one block. Through extensive experiments, we demonstrate that TORE outperforms state-of-the-art methods while reducing computational overhead by up to 90\%.

TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration

TL;DR

Abstract

Paper Structure (29 sections, 7 equations, 7 figures, 5 tables)

This paper contains 29 sections, 7 equations, 7 figures, 5 tables.

Introduction
Related Works
Active Visual Exploration.
Efficient computations for vision models.
Method
Vision Transformers.
AVE sequential prediction.
TOken REcycling (TORE)
Extractor-aggregator framework.
Efficient sequential inference with TORE forward pass.
Attention Map Entropy (AME).
Lightweight decoder.
Training with Random Glimpse Selection Policy.
Random $\kappa$ sampling.
Experimental Setup
...and 14 more sections

Figures (7)

Figure 1: The forward pass of TORE for two sequential glimpses, denoted as $G^1$ and $G^2$. First, the glimpse $G^1$ is processed by the extractor to generate the midway token $T^1$. This token is passed to the aggregator to obtain a prediction, but at the same time, it is cached for future use. When the glimpse $G^2$ appears, it is processed by the extractor to generate the midway token $T^2$, which is passed to the aggregator together with the $T^1$ reused from the cache. This way, the extractor processes each glimpse only once, significantly reducing the computations.
Figure 2: TORE is an efficient approach to Active Visual Exploration based on vision transformers. In each step, it uses AME selection policy to collect the next glimpse (e.g. $G^4$) based on the entropy map generated based on previous glimpses ($G^1,\dots,G^3$). Then, $G^4$ is divided into patches, which are tokenized and pushed through the extractor. Midway tokens are stored in the cache and passed to the aggregator to obtain a prediction. Thanks to the cache of tokens, each glimpse is processed by extractor only once, which reduces the computational overhead. Please note that the classification token $c$ is defined as the average from all classification tokens generated by the extractor.
Figure 3: Our training policy, in which we sample $\kappa$ responsible for the size of extractor and aggregator from the uniform distribution for each batch to ensure the model's flexibility during inference. In this example, the number of extractor blocks equals 2 and 6 for the top and bottom parts, respectively.
Figure 4: Visualization of active visual exploration performed by TORE. On the left side, we present entropy maps used by the selection policy, its successive fields of view, and the corresponding predictions. On the right side, we present the original images.
Figure 5: TORE accuracy (left) and resource utilization (right) with respect to the number of exploration steps on CIFAR100. The accuracy consistently improves with the increasing number of exploration steps and decreasing value of $\kappa$ ranging from 60% to 80% at the 12th step. Interestingly, minor drops in accuracy (e.g., $\text{TORE}_4$) correspond to a significant 20% reduction in computational cost.
...and 2 more figures

TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration

TL;DR

Abstract

TORE: Token Recycling in Vision Transformers for Efficient Active Visual Exploration

Authors

TL;DR

Abstract

Table of Contents

Figures (7)