Table of Contents
Fetching ...

On the Effect of Image Resolution on Semantic Segmentation

Ritambhara Singh, Abhishek Jain, Pietro Perona, Shivani Agarwal, Junfeng Yang

TL;DR

This work challenges the standard practice of downsampling for semantic segmentation by presenting a streamlined, high-resolution architecture that outputs outputs directly at native image resolution. It combines a ResNet-like classifier network with a multi-scale segmentation head that uses Bottom-Up Propagation to fuse coarse and fine features, improving detail preservation without excessive computation. The approach is validated across multiple datasets, with Cityscapes gains further amplified by Noisy Student Training, highlighting the value of semi-supervised learning in high-resolution segmentation. The results suggest that carefully designed multi-scale fusion can achieve state-of-the-art performance without the heavy computational burden of traditional high-resolution models, with implications for real-time, detailed scene understanding in applications like autonomous driving.

Abstract

High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model capable of directly producing high-resolution segmentations can match the performance of more complex systems that generate lower-resolution results. By simplifying the network architecture, we enable the processing of images at their native resolution. Our approach leverages a bottom-up information propagation technique across various scales, which we have empirically shown to enhance segmentation accuracy. We have rigorously tested our method using leading-edge semantic segmentation datasets. Specifically, for the Cityscapes dataset, we further boost accuracy by applying the Noisy Student Training technique.

On the Effect of Image Resolution on Semantic Segmentation

TL;DR

This work challenges the standard practice of downsampling for semantic segmentation by presenting a streamlined, high-resolution architecture that outputs outputs directly at native image resolution. It combines a ResNet-like classifier network with a multi-scale segmentation head that uses Bottom-Up Propagation to fuse coarse and fine features, improving detail preservation without excessive computation. The approach is validated across multiple datasets, with Cityscapes gains further amplified by Noisy Student Training, highlighting the value of semi-supervised learning in high-resolution segmentation. The results suggest that carefully designed multi-scale fusion can achieve state-of-the-art performance without the heavy computational burden of traditional high-resolution models, with implications for real-time, detailed scene understanding in applications like autonomous driving.

Abstract

High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model capable of directly producing high-resolution segmentations can match the performance of more complex systems that generate lower-resolution results. By simplifying the network architecture, we enable the processing of images at their native resolution. Our approach leverages a bottom-up information propagation technique across various scales, which we have empirically shown to enhance segmentation accuracy. We have rigorously tested our method using leading-edge semantic segmentation datasets. Specifically, for the Cityscapes dataset, we further boost accuracy by applying the Noisy Student Training technique.
Paper Structure (16 sections, 2 figures, 2 tables)

This paper contains 16 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The proposed network. The residual blocks are the same as in he2016deep. The gray areas are the bottom-up propagation stages.
  • Figure 2: The merge module used in the proposed network. For the upsampling, bilinear interpolation is used to avoid the checkerboard artifact odena2016deconvolution.