Semantic Scene Segmentation for Robotics

Juana Valeria Hurtado; Abhinav Valada

Semantic Scene Segmentation for Robotics

Juana Valeria Hurtado, Abhinav Valada

TL;DR

Semantic Scene Segmentation for Robotics surveys the task of dense pixel-wise scene labeling and its central role in robot perception. It systematically reviews traditional and deep learning architectures, with emphasis on encoder–decoder designs, multi-scale context modules such as ASPP/PSPNet, and real-time variants, while also covering multi-input settings including video, point clouds, and multimodal fusion. The chapter catalogs outdoor, indoor, and general-purpose datasets and standard metrics (e.g., $PA$, $IoU$, $mIoU$, PRC-based measures) and discusses computational considerations like runtime and FLOPs. It highlights challenges such as data labeling bottlenecks and the need for real-time, robust models, pointing toward future directions in weakly/self-supervised learning, transfer learning, and holistic scene representations like panoptic segmentation for robotics applications.

Abstract

Comprehensive scene understanding is a critical enabler of robot autonomy. Semantic segmentation is one of the key scene understanding tasks which is pivotal for several robotics applications including autonomous driving, domestic service robotics, last mile delivery, amongst many others. Semantic segmentation is a dense prediction task that aims to provide a scene representation in which each pixel of an image is assigned a semantic class label. Therefore, semantic segmentation considers the full scene context, incorporating the object category, location, and shape of all the scene elements, including the background. Numerous algorithms have been proposed for semantic segmentation over the years. However, the recent advances in deep learning combined with the boost in the computational capacity and the availability of large-scale labeled datasets have led to significant advances in semantic segmentation. In this chapter, we introduce the task of semantic segmentation and present the deep learning techniques that have been proposed to address this task over the years. We first define the task of semantic segmentation and contrast it with other closely related scene understanding problems. We detail different algorithms and architectures for semantic segmentation and the commonly employed loss functions. Furthermore, we present an overview of datasets, benchmarks, and metrics that are used in semantic segmentation. We conclude the chapter with a discussion of challenges and opportunities for further research in this area.

Semantic Scene Segmentation for Robotics

TL;DR

, PRC-based measures) and discusses computational considerations like runtime and FLOPs. It highlights challenges such as data labeling bottlenecks and the need for real-time, robust models, pointing toward future directions in weakly/self-supervised learning, transfer learning, and holistic scene representations like panoptic segmentation for robotics applications.

Abstract

Paper Structure (49 sections, 9 equations, 13 figures)

This paper contains 49 sections, 9 equations, 13 figures.

Semantic Scene Segmentation for Robotics
Introduction
Algorithms and Architectures for Semantic Segmentation
Traditional Methods
Deep Learning Methods
Encoder Variants
Upsampling Methods
Techniques for Exploiting Context
Encoder-Decoder Architecture
Image Pyramid
Conditional Random Fields
Spatial Pyramid Pooling
Dilated Convolution
Real-Time Architectures
Object Detection-based Methods
...and 34 more sections

Figures (13)

Figure 1: Each row shows an example with an input image and the corresponding output of different scene understanding tasks. Object classification identifies "what" objects compose the image, object detection predicts "where" the objects are located in the image, object segmentation outputs a mask that indicates the shape of the object. Semantic Segmentation further details the input image by predicting the label of all the pixels, including the background.
Figure 2: Illustration of the semantic segmentation output in which a semantic class label is assigned to each pixel in the image. The network predicts label indices for each pixel which is depicted with different colors for visualization purposes. The predictions overlaid on the input image is shown on the left and the label indices overlaid on the input image is shown on the right right.
Figure 3: Comparison of semantic segmentation, instance segmentation, and panoptic segmentation tasks. Semantic segmentation assigns a class label to each pixel in the image and instance segmentation assigns an instance ID to pixels belonging to individual objects as well as semantic class label to each pixel in the image. The panoptic segmentation task unifies semantic and instance segmentation.
Figure 4: An example topology of a Convolutional Neural Network (CNN) used for image classification (top) and a Fully Convolutional Network (FCN) that is used for dense prediction (bottom). Note that FCNs do not contain any fully connected layers.
Figure 5: An example topology of an encoder-decoder architecture. Typically, the encoder is a pre-trained classification CNN that uses downsampling layers to capture the contextual information. On the other hand, the decoder network is composed of up-sampling layers to recover the spatial information, yielding the pixel-level classification output with the same resolution as the input image.
...and 8 more figures

Semantic Scene Segmentation for Robotics

TL;DR

Abstract

Semantic Scene Segmentation for Robotics

Authors

TL;DR

Abstract

Table of Contents

Figures (13)