Table of Contents
Fetching ...

The revenge of BiSeNet: Efficient Multi-Task Image Segmentation

Gabriele Rosi, Claudia Cuttano, Niccolò Cavagnero, Giuseppe Averta, Fabio Cermelli

TL;DR

BiSeNetFormer tackles the challenge of real-time multi-task image segmentation by marrying the efficiency of two-stream semantic architectures with a mask-classification head powered by a transformer decoder. The model retains a spatial path for high-resolution details and a contextual path for semantic richness, while employing a Mask2Former-style decoder to generate $N$ segment embeddings and corresponding binary masks with class probabilities, achieving high FPS (up to about 100 FPS in some settings) on Cityscapes and ADE20K with competitive accuracy. Key contributions include (1) a fully masked-classification, two-stream backbone that supports semantic and panoptic segmentation, (2) a compact transformer head with limited masked attention to maintain speed, and (3) extensive hardware demonstrations showing edge-device viability. The results indicate BiSeNetFormer as a practical option for fast, adaptable segmentation in real-world applications, balancing efficiency and task versatility across standard benchmarks.

Abstract

Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.

The revenge of BiSeNet: Efficient Multi-Task Image Segmentation

TL;DR

BiSeNetFormer tackles the challenge of real-time multi-task image segmentation by marrying the efficiency of two-stream semantic architectures with a mask-classification head powered by a transformer decoder. The model retains a spatial path for high-resolution details and a contextual path for semantic richness, while employing a Mask2Former-style decoder to generate segment embeddings and corresponding binary masks with class probabilities, achieving high FPS (up to about 100 FPS in some settings) on Cityscapes and ADE20K with competitive accuracy. Key contributions include (1) a fully masked-classification, two-stream backbone that supports semantic and panoptic segmentation, (2) a compact transformer head with limited masked attention to maintain speed, and (3) extensive hardware demonstrations showing edge-device viability. The results indicate BiSeNetFormer as a practical option for fast, adaptable segmentation in real-world applications, balancing efficiency and task versatility across standard benchmarks.

Abstract

Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.
Paper Structure (12 sections, 5 equations, 3 figures, 7 tables)

This paper contains 12 sections, 5 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: BiSeNetFormer delivers comparable or superior performance in comparison to existing methods while being the fastest multi-task architecture for image segmentation.
  • Figure 2: Architecture of BiSeNetFormer with the three main components highlighted: spatial path (yellow), context path (violet) and transformer decoder (red). The spatial path extracts high-resolution features from the input image; the context path enlarges the receptive field and obtains highly semantical visual features; the transformer decoder takes as input a set of learnable queries and the high-resolution features to produce segment embeddings. A segmentation head then merges the spatial and context path features and then computes the final binary masks and class probabilities.
  • Figure 3: Qualitative Results for panoptic segmentation on the Cityscapes cityscapes dataset. Best seen with colors and digital zoom.