Table of Contents
Fetching ...

ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

Narges Norouzi, Svetlana Orlova, Daan de Geus, Gijs Dubbelman

TL;DR

ALGM introduces a two-stage, adaptive token merging framework for ViT-based semantic segmentation that first performs local merging in the initial layer and then global merging in a mid-network layer. By leveraging cosine similarity between tokens and an automatically computed threshold, ALGM reduces token counts without sacrificing, and often improving, segmentation quality, while delivering substantial throughput gains. The method is parameter-free, integration-friendly with plain ViTs and various decoders, and can be tuned for maximum efficiency (ALGM*) or accuracy. Across ADE20K and other datasets, ALGM outperforms existing token-reduction methods in terms of the efficiency-quality trade-off, and scales to state-of-the-art models like EVA-based pipelines. These results demonstrate a practical path to faster, accurate segmentation without additional training complexity.

Abstract

This work presents Adaptive Local-then-Global Merging (ALGM), a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer, it merges similar tokens within a small local window and (2) halfway through the network, it merges similar tokens across the entire image. This is motivated by an analysis in which we found that, in those situations, tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations, we show that ALGM not only significantly improves the throughput by up to 100%, but can also enhance the mean IoU by up to +1.1, thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover, our approach is adaptive during inference, meaning that the same model can be used for optimal efficiency or accuracy, depending on the application. Code is available at https://tue-mps.github.io/ALGM.

ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

TL;DR

ALGM introduces a two-stage, adaptive token merging framework for ViT-based semantic segmentation that first performs local merging in the initial layer and then global merging in a mid-network layer. By leveraging cosine similarity between tokens and an automatically computed threshold, ALGM reduces token counts without sacrificing, and often improving, segmentation quality, while delivering substantial throughput gains. The method is parameter-free, integration-friendly with plain ViTs and various decoders, and can be tuned for maximum efficiency (ALGM*) or accuracy. Across ADE20K and other datasets, ALGM outperforms existing token-reduction methods in terms of the efficiency-quality trade-off, and scales to state-of-the-art models like EVA-based pipelines. These results demonstrate a practical path to faster, accurate segmentation without additional training complexity.

Abstract

This work presents Adaptive Local-then-Global Merging (ALGM), a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer, it merges similar tokens within a small local window and (2) halfway through the network, it merges similar tokens across the entire image. This is motivated by an analysis in which we found that, in those situations, tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations, we show that ALGM not only significantly improves the throughput by up to 100%, but can also enhance the mean IoU by up to +1.1, thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover, our approach is adaptive during inference, meaning that the same model can be used for optimal efficiency or accuracy, depending on the application. Code is available at https://tue-mps.github.io/ALGM.
Paper Structure (52 sections, 14 figures, 12 tables)

This paper contains 52 sections, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Efficiency and segmentation quality for ALGM, applied to Segmenter strudel2021segmenter, SegViT zhang2022segvit, and SETR zheng2021setr on ADE20K. On average, ALGM improves the throughput of these baselines by 39%, while improving the mIoU by +0.7.
  • Figure 2: Comparison of cosine similarity between intra-class and inter-class tokens. On ADE20K training set using Segmenter + ViT-S strudel2021segmenterdosovitskiy2021an. (a) Local similarities across 5 window sizes in the first layer. (b) Layer-wise analysis of global similarities.
  • Figure 3: ALGM comprises two primary modules: (1) Conditional Local Average Pooling (CLAP) for local merging and (2) Global Bipartite Matching (GBM) for global merging. The top section illustrates the placement of these modules in the first and middle layers, while the bottom provides a detailed visualization of the individual modules.
  • Figure 4: Similarity thresholds for token merging. ALGM applied to Segmenter strudel2021segmenter with ViT-S dosovitskiy2021an on ADE20K zhou2017ade20k.
  • Figure 5: Merged tokens. We depict tokens that are merged as a result of the local CLAP and global GBM merging modules.
  • ...and 9 more figures