Table of Contents
Fetching ...

ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition

Seungdong Yoa, Seungjun Lee, Hyeseung Cho, Bumsoo Kim, Woohyung Lim

TL;DR

Vision Transformers suffer from high computational cost due to dense self-attention over a large token set. ImagePiece introduces a content-aware retokenization that merges non-semantic tokens into meaningful chunks, guided by a local coherence bias and a MaxMatch-inspired process, then reevaluates token importance to prune truly non-semantic tokens. The method consistently improves accuracy while boosting throughput, outperforming both pruning- and merging-based baselines on ImageNet-1k, and remains robust under hyper-speed and masking scenarios. This approach offers a practical path to efficient ViTs with preserved semantic fidelity, expanding the applicability of large-scale transformers in resource-constrained settings.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. However, ViTs have a huge computational cost due to their inherent reliance on multi-head self-attention (MHSA), prompting efforts to accelerate ViTs for practical applications. To this end, recent works aim to reduce the number of tokens, mainly focusing on how to effectively prune or merge them. Nevertheless, since ViT tokens are generated from non-overlapping grid patches, they usually do not convey sufficient semantics, making it incompatible with efficient ViTs. To address this, we propose ImagePiece, a novel re-tokenization strategy for Vision Transformers. Following the MaxMatch strategy of NLP tokenization, ImagePiece groups semantically insufficient yet locally coherent tokens until they convey meaning. This simple retokenization is highly compatible with previous token reduction methods, being able to drastically narrow down relevant tokens, enhancing the inference speed of DeiT-S by 54% (nearly 1.5$\times$ faster) while achieving a 0.39% improvement in ImageNet classification accuracy. For hyper-speed inference scenarios (with 251% acceleration), our approach surpasses other baselines by an accuracy over 8%.

ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition

TL;DR

Vision Transformers suffer from high computational cost due to dense self-attention over a large token set. ImagePiece introduces a content-aware retokenization that merges non-semantic tokens into meaningful chunks, guided by a local coherence bias and a MaxMatch-inspired process, then reevaluates token importance to prune truly non-semantic tokens. The method consistently improves accuracy while boosting throughput, outperforming both pruning- and merging-based baselines on ImageNet-1k, and remains robust under hyper-speed and masking scenarios. This approach offers a practical path to efficient ViTs with preserved semantic fidelity, expanding the applicability of large-scale transformers in resource-constrained settings.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. However, ViTs have a huge computational cost due to their inherent reliance on multi-head self-attention (MHSA), prompting efforts to accelerate ViTs for practical applications. To this end, recent works aim to reduce the number of tokens, mainly focusing on how to effectively prune or merge them. Nevertheless, since ViT tokens are generated from non-overlapping grid patches, they usually do not convey sufficient semantics, making it incompatible with efficient ViTs. To address this, we propose ImagePiece, a novel re-tokenization strategy for Vision Transformers. Following the MaxMatch strategy of NLP tokenization, ImagePiece groups semantically insufficient yet locally coherent tokens until they convey meaning. This simple retokenization is highly compatible with previous token reduction methods, being able to drastically narrow down relevant tokens, enhancing the inference speed of DeiT-S by 54% (nearly 1.5 faster) while achieving a 0.39% improvement in ImageNet classification accuracy. For hyper-speed inference scenarios (with 251% acceleration), our approach surpasses other baselines by an accuracy over 8%.

Paper Structure

This paper contains 27 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: An illustration of the ImagePiece pipeline compared to WordPiece in NLP. While text is tokenized into meaningful tokens, image tokens from patches often contain irrelevant or non-semantic information. The retokenization in ImagePiece enables these non-semantic tokens to be merged into meaningful tokens, particularly when they have the potential to become meaningful after retokenization.
  • Figure 2: Overall architecture of the proposed method.
  • Figure 3: Comparison of our ImagePiece with the patch tokenizer of ViT in terms of inference speed and accuracy. While the baselines show a marked decline in performance as throughput increases, our method not only maintains relatively robust accuracy but also demonstrates compatibility, even at high speeds.