Early Fusion of Features for Semantic Segmentation
Anupam Gupta, Ashok Krishnamurthy, Lisa Singh
TL;DR
This work introduces a memory-efficient semantic segmentation framework by pairing a frozen ResNet-50 classifier with a reverse HRNet decoder to fuse multi-scale features. A 1x1 channel alignment enables seamless integration across the classifier and decoder, and an additional high-resolution stream is added while keeping memory usage in check. The model is pretrained semisupervised on ResNet-50, then evaluated on multiple datasets (Mapillary Vistas, Cityscapes, CamVid, COCO, PASCAL-VOC2012) using pixel accuracy and mIoU, showing competitive segmentation performance with reduced memory demands. The approach underscores the value of high-resolution feature preservation for precise segmentation and points to future work on further efficiency improvements without sacrificing accuracy.
Abstract
This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation. Our approach utilizes a ResNet-50 backbone, pretrained in a semi-supervised manner, to generate feature maps at various scales. These maps are then processed by a reverse HRNet, which is adapted to handle varying channel dimensions through 1x1 convolutions, to produce the final segmentation output. We strategically avoid fine-tuning the backbone network to minimize memory consumption during training. Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012, employing metrics such as pixel accuracy and mean Intersection over Union (mIoU) to evaluate segmentation performance. The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis. By leveraging the strengths of both the ResNet-50 and reverse HRNet within a unified framework, we present a robust solution to the challenges of image segmentation.
