Accelerating Transformers with Spectrum-Preserving Token Merging

Hoai-Chau Tran; Duy M. H. Nguyen; Duy M. Nguyen; Trung-Tin Nguyen; Ngan Le; Pengtao Xie; Daniel Sonntag; James Y. Zou; Binh T. Nguyen; Mathias Niepert

Accelerating Transformers with Spectrum-Preserving Token Merging

Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung-Tin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh T. Nguyen, Mathias Niepert

TL;DR

A novel paradigm called PiToMe is presented, which prioritizes the preservation of informative tokens using an additional metric termed the energy score, which identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved.

Abstract

Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions

Accelerating Transformers with Spectrum-Preserving Token Merging

TL;DR

Abstract

Paper Structure (31 sections, 8 theorems, 46 equations, 15 figures, 10 tables, 1 algorithm)

This paper contains 31 sections, 8 theorems, 46 equations, 15 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Methodology
Token Merging Formulation
Energy-based Merging
Connection to Graph Coarsening with Spectral Preservation
Experiments
Image & Text Retrieval
Visual Question Answering (VQA) with Large Vision-Language Models
Image Classification on Imagenet-1k
Text Classification
PiToMe Ablation Studies
Conclusion
Datasets Descriptions
PiToMe Algorithm
...and 16 more sections

Key Result

Lemma 1

The normalized Laplacian eigenvalues of the lifted graph $\boldsymbol{\lambda}_l$ contain all the eigenvalues of the coarse graph $\boldsymbol{\lambda}_c$ and additional eigenvalues $1$ with $(N-n)$ multiplicity.

Figures (15)

Figure 1: A comparison of token merging methods. Patches of the same color are merged. Green arrows highlight incorrect merges, avoided by PiToMe. Position of tokens with high attention scores (cyan borders, zoom for clarity) in PiToMe are maintained proportionality akin to ViT-base 384.
Figure 2: a)PiToMe can be inserted inside transformer block; b) Energy scores are computed to identify mergeable and protective tokens; c) Our algorithm gradually merges tokens in each block.
Figure 3: Off-the-shellImage-Text Retrieval comparison between PiToMe v.s. merging/pruning methods on different backbones on tasks when varying the number of merged tokens. Here, Recall sum =$Rt@1+ Rt@5+Rt@10+Ri@1+ Ri@5+Ri@10$ is close to 600, indicating recall scores at top 1,5, and 10 for retrieving image and text reached close to 100%. PiToMe curves, in most cases, are above other baselines.
Figure 4: Off-the-shelf performance of PiToMe on LLaVA-1.5-7B with different compressing ratio $r$.
Figure 5: Off-the-shelf results on Imagenet-1k. Zoom in for better view.
...and 10 more figures

Theorems & Definitions (11)

Definition 1: Graph Coarsening
Definition 2: Graph Lifting
Lemma 1: Eigenvalue Preservation, see e.g., jin_graph_2020loukas_graph_2019toivonen_compression_2011butler_interlacing_2007
Theorem 1: Spectrum Consistent of Token Merging
Proposition 1
Proposition 2
Proposition 3
Lemma 2
Lemma 3
proof : Proof of the 2-node triangle inequality \ref{['eq_Gc_norm1_2Nodes']}
...and 1 more

Accelerating Transformers with Spectrum-Preserving Token Merging

TL;DR

Abstract

Accelerating Transformers with Spectrum-Preserving Token Merging

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (11)