Table of Contents
Fetching ...

Early Fusion of Features for Semantic Segmentation

Anupam Gupta, Ashok Krishnamurthy, Lisa Singh

TL;DR

This work introduces a memory-efficient semantic segmentation framework by pairing a frozen ResNet-50 classifier with a reverse HRNet decoder to fuse multi-scale features. A 1x1 channel alignment enables seamless integration across the classifier and decoder, and an additional high-resolution stream is added while keeping memory usage in check. The model is pretrained semisupervised on ResNet-50, then evaluated on multiple datasets (Mapillary Vistas, Cityscapes, CamVid, COCO, PASCAL-VOC2012) using pixel accuracy and mIoU, showing competitive segmentation performance with reduced memory demands. The approach underscores the value of high-resolution feature preservation for precise segmentation and points to future work on further efficiency improvements without sacrificing accuracy.

Abstract

This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation. Our approach utilizes a ResNet-50 backbone, pretrained in a semi-supervised manner, to generate feature maps at various scales. These maps are then processed by a reverse HRNet, which is adapted to handle varying channel dimensions through 1x1 convolutions, to produce the final segmentation output. We strategically avoid fine-tuning the backbone network to minimize memory consumption during training. Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012, employing metrics such as pixel accuracy and mean Intersection over Union (mIoU) to evaluate segmentation performance. The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis. By leveraging the strengths of both the ResNet-50 and reverse HRNet within a unified framework, we present a robust solution to the challenges of image segmentation.

Early Fusion of Features for Semantic Segmentation

TL;DR

This work introduces a memory-efficient semantic segmentation framework by pairing a frozen ResNet-50 classifier with a reverse HRNet decoder to fuse multi-scale features. A 1x1 channel alignment enables seamless integration across the classifier and decoder, and an additional high-resolution stream is added while keeping memory usage in check. The model is pretrained semisupervised on ResNet-50, then evaluated on multiple datasets (Mapillary Vistas, Cityscapes, CamVid, COCO, PASCAL-VOC2012) using pixel accuracy and mIoU, showing competitive segmentation performance with reduced memory demands. The approach underscores the value of high-resolution feature preservation for precise segmentation and points to future work on further efficiency improvements without sacrificing accuracy.

Abstract

This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation. Our approach utilizes a ResNet-50 backbone, pretrained in a semi-supervised manner, to generate feature maps at various scales. These maps are then processed by a reverse HRNet, which is adapted to handle varying channel dimensions through 1x1 convolutions, to produce the final segmentation output. We strategically avoid fine-tuning the backbone network to minimize memory consumption during training. Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012, employing metrics such as pixel accuracy and mean Intersection over Union (mIoU) to evaluate segmentation performance. The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis. By leveraging the strengths of both the ResNet-50 and reverse HRNet within a unified framework, we present a robust solution to the challenges of image segmentation.
Paper Structure (11 sections, 2 figures)

This paper contains 11 sections, 2 figures.

Figures (2)

  • Figure 1: The proposed network. The gray blocks are the pretrained backbone and are kept frozen during training.
  • Figure 2: This figure presents a detailed visual comparison of semantic segmentation results between our proposed method and HRNet. Demonstrating the advantages of our approach, it is evident that our method produces more refined multi-scale feature maps, capturing intricate local edges and textures with superior precision. This enhancement in feature map quality significantly contributes to the improved performance of semantic segmentation, showcasing our method's ability to handle complex scenes with enhanced detail and accuracy.