The revenge of BiSeNet: Efficient Multi-Task Image Segmentation
Gabriele Rosi, Claudia Cuttano, Niccolò Cavagnero, Giuseppe Averta, Fabio Cermelli
TL;DR
BiSeNetFormer tackles the challenge of real-time multi-task image segmentation by marrying the efficiency of two-stream semantic architectures with a mask-classification head powered by a transformer decoder. The model retains a spatial path for high-resolution details and a contextual path for semantic richness, while employing a Mask2Former-style decoder to generate $N$ segment embeddings and corresponding binary masks with class probabilities, achieving high FPS (up to about 100 FPS in some settings) on Cityscapes and ADE20K with competitive accuracy. Key contributions include (1) a fully masked-classification, two-stream backbone that supports semantic and panoptic segmentation, (2) a compact transformer head with limited masked attention to maintain speed, and (3) extensive hardware demonstrations showing edge-device viability. The results indicate BiSeNetFormer as a practical option for fast, adaptable segmentation in real-world applications, balancing efficiency and task versatility across standard benchmarks.
Abstract
Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.
