Table of Contents
Fetching ...

Efficient Transformer Encoders for Mask2Former-style models

Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker

TL;DR

ECO-M2F introduces an adaptive encoder depth mechanism for Mask2Former-style segmentation by training a lightweight gating network to select the optimal number of transformer encoder layers per input image. The approach follows a three-step recipe: Step A trains a dynamic encoder with weighted stochastic depth, Step B derives a per-image exit-quality dataset, and Step C trains the gating network to balance Panoptic Quality with computational cost via a utility function $u(k) = q_k^{(i)} - eta k$. This framework enables efficient inference with a single decoder shared across exits, reduces training costs when budgets change, and demonstrates competitive accuracy with substantial GFLOPs reductions on COCO and Cityscapes, while remaining extensible to object detection. The work highlights the practicality of adaptive computation in universal segmentation and provides a flexible pathway to deploy efficient transformers on edge devices. Limitations include the need to tune the adaptation factor $eta$ for different use cases.

Abstract

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.

Efficient Transformer Encoders for Mask2Former-style models

TL;DR

ECO-M2F introduces an adaptive encoder depth mechanism for Mask2Former-style segmentation by training a lightweight gating network to select the optimal number of transformer encoder layers per input image. The approach follows a three-step recipe: Step A trains a dynamic encoder with weighted stochastic depth, Step B derives a per-image exit-quality dataset, and Step C trains the gating network to balance Panoptic Quality with computational cost via a utility function . This framework enables efficient inference with a single decoder shared across exits, reduces training costs when budgets change, and demonstrates competitive accuracy with substantial GFLOPs reductions on COCO and Cityscapes, while remaining extensible to object detection. The work highlights the practicality of adaptive computation in universal segmentation and provides a flexible pathway to deploy efficient transformers on edge devices. Limitations include the need to tune the adaptation factor for different use cases.

Abstract

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
Paper Structure (28 sections, 9 equations, 6 figures, 5 tables)

This paper contains 28 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison to prior works. Instead of conventional M2F-style architecture that provides a "one-size-fits-all" solution, our method ECO-M2F trains models to run directlys at various resource encoder depths by leveraging a gating function. B, E, D, and G denote the backbone, encoder, decoder, and (our proposed) gating network, respectively.
  • Figure 2: (a) Histogram of images achieving best panoptic segmentation by the number of encoder layers. (b) Our method demonstrates superior performance and lower computational cost compared to the baseline models. (Dataset: Cityscapes; Backbone: SWIN-T)
  • Figure 3: ECO-M2F framework. During the model pre-processing phase, we train the model to exit stochastically at $K$ potential exits using Step A. Next, in Step B, we use this model to perform inference on the training images at each exit to create a dataset $\mathcal{D}$. In the model adaptation phase, we perform Step C to establish a gating target based on the computational budget and train a lightweight gating network. During inference, the network exists at the layer designated by the gating network.
  • Figure 4: Intuition for \ref{['eq:utility_func']}. This figure shows that prioritizing PQ requires more encoder layers, while fewer layers lead to poorer PQ. (Backbone: SWIN-T; training set).
  • Figure 5: ECO-M2F outperforms M2F with WSD by effectively reducing computation without compromising performance, as shown by the varying results with the Gating module across different $\beta$ values. (Dataset: Cityscapes)
  • ...and 1 more figures