ContextFormer: Redefining Efficiency in Semantic Segmentation
Mian Muhammad Naeem Abid, Nancy Mehta, Zongwei Wu, Radu Timofte
TL;DR
ContextFormer addresses the bottleneck inefficiency in semantic segmentation by introducing a hybrid CNN–ViT framework that couples local and global context in the bottleneck. It combines a Token Pyramid Extraction Module (TPEM) for multi-scale token generation, a Hybrid Trans-BDC bottleneck that fuses Branched Depthwise Convolutions with lightweight self-attention, and a Feature Merging Module (FMM) with a gated fusion mechanism to produce coherent segmentation maps. Across ADE20K, Pascal Context, CityScapes, and COCO-Stuff, ContextFormer achieves competitive or superior mIoU while substantially reducing GFLOPs and parameter counts compared to state-of-the-art efficient models, demonstrating strong efficiency–accuracy trade-offs and robustness to real-world conditions. The approach is pretrained on ImageNet and demonstrates transferable benefits to downstream tasks like object detection, highlighting its practical impact for real-time, resource-constrained scenarios in semantic segmentation.
Abstract
Semantic segmentation assigns labels to pixels in images, a critical yet challenging task in computer vision. Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands, especially for high-resolution inputs. Most research optimizes the encoder architecture, leaving the bottleneck underexplored - a key area for enhancing performance and efficiency. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation. The framework's efficiency is driven by three synergistic modules: the Token Pyramid Extraction Module (TPEM) for hierarchical multi-scale representation, the Transformer and Branched DepthwiseConv (Trans-BDC) block for dynamic scale-aware feature modeling, and the Feature Merging Module (FMM) for robust integration with enhanced spatial and contextual consistency. Extensive experiments on ADE20K, Pascal Context, CityScapes, and COCO-Stuff datasets show ContextFormer significantly outperforms existing models, achieving state-of-the-art mIoU scores, setting a new benchmark for efficiency and performance. The codes will be made publicly available upon acceptance.
