Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Abhishek Aich; Yumin Suh; Samuel Schulter; Manmohan Chandraker

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Abhishek Aich, Yumin Suh, Samuel Schulter, Manmohan Chandraker

TL;DR

A strategy termed PROgressive Token Length SCALing for Efficient transformer encoders (PRO-SCALE) that can be plugged-in to the Mask2Former segmentation architecture to significantly reduce the computational cost.

Abstract

A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. With this observation, we propose a strategy termed PROgressive Token Length SCALing for Efficient transformer encoders (PRO-SCALE) that can be plugged-in to the Mask2Former segmentation architecture to significantly reduce the computational cost. The underlying principle of PRO-SCALE is: progressively scale the length of the tokens with the layers of the encoder. This allows PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance (~52% encoder and ~27% overall GFLOPs reduction with no drop in performance on COCO dataset). Experiments conducted on public benchmarks demonstrates PRO-SCALE's flexibility in architectural configurations, and exhibits potential for extension beyond the settings of segmentation tasks to encompass object detection. Code here: https://github.com/abhishekaich27/proscale-pytorch

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

TL;DR

Abstract

Paper Structure (39 sections, 2 equations, 11 figures, 16 tables)

This paper contains 39 sections, 2 equations, 11 figures, 16 tables.

Introduction
Related Works
Methodology
Framework Overview.
PRO-SCALE: Proposed Transformer Encoder
Intuition.
Encoder structure.
Light-Pixel Embedding (LPE) Module
Intuition.
LPE structure.
Experiments
Evaluation metrics.
Baselines.
Architecture Details.
Main Results
...and 24 more sections

Figures (11)

Figure 1: Compute distribution.(a) Mask2Former-style segmentation model (b) In the Mask2Former model using Res50 he2016deep and SWIN-T liu2021swin backbones, the transformer encoder contributes the most to the overall computation cost, accounting for 54.04% and 50.38%, respectively.
Figure 2: Key idea and performance comparison of PRO-SCALE w.r.t. prior works.(a) illustrates the key idea of PRO-SCALE to progressively extend the token length in the transformer encoder. $\{{\mathbf{s}}_2, {\mathbf{s}}_3, {\mathbf{s}}_4\}$ represent different resolutions. In (b), we show two instantiates of our proposed transformer encoder PRO-SCALE, compared with Mask2Former (M2F) cheng2021mask2former and RT-M2F (an adaptation of lv2023detrs). PRO-SCALE eliminates 80.43% (with configuration ($p_1$, $p_2$, $p_3$) = (1,1,1)) and 51.98% (with configuration ($p_1$, $p_2$, $p_3$) = (3,3,3)) of encoder GFLOPs from M2F while maintaining the competitive performance. Results are computed on the COCO lin2014microsoft dataset.
Figure 3: Proposed framework. Our model includes our transformer encoder PRO-SCALE (Sec. \ref{['sec:our_enc']}), designed to reduce the computational load. $\{{\mathbf{s}}_i\}$s represent the multi-scale backbone features. PRO-SCALE progressively scale the length of the tokens with the layers of the encoder. This allows PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance. Further, $s_1$ goes through our efficient Light-Pixel Embedding (LPE) module (Sec. \ref{['sec:lpe']}) to create pixel embeddings for mask prediction. $p_1$, $p_2$ and $p_3$ represent encoder layer frequency. $\{s^\prime, s^{\prime\prime}, s^{\prime\prime\prime}\}$ represent the outputs of respective layers.
Figure 4: Impact of LPE module. Per-pixel embeddings produced by LPE do not significantly harm the performance but demonstrate a strong impact on the computational reduction. Here, backbone= SWIN-T, PRO-SCALE configuration: $c_1 = (1, 1, 1)$, $c_2= (2, 2, 2)$, $c_3= (3, 3, 3)$, models = w/o LPE and w/ LPE), dataset = COCO.
Figure 5: Impact of pre-trained weights.PRO-SCALE provides significant computational boosts, irrespective of backbone pre-trained weights. Here, backbone/dataset= SWIN-T/COCO, weights = supervised/ MoBYxie2021self on IN1K russakovsky2015imagenet, PRO-SCALE config.: $c_1 = (1, 1, 1)$, $c_2= (3, 1, 1)$, $c_3= (1, 3, 1)$, $c_4= (1, 1, 3)$.
...and 6 more figures

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

TL;DR

Abstract

Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)