Table of Contents
Fetching ...

Efficient Multi-Task Scene Analysis with RGB-D Transformers

Söhnke Benedikt Fischedick, Daniel Seichter, Robin Schmidt, Leonard Rabes, Horst-Michael Gross

TL;DR

This work tackles the challenge of real-time, multi-task scene understanding for mobile robots by introducing EMSAFormer, a single RGB-D Swin Transformer encoder that jointly performs panoptic segmentation, instance orientation estimation, and scene classification. It replaces the previous dual CNN encoder with a unified Transformer backbone and adds a specialized TensorRT extension to achieve real-time inference on embedded hardware. Through extensive experiments on NYUv2, SUNRGB-D, and ScanNet, EMSAFormer achieves state-of-the-art or competitive results across tasks while delivering up to 39.1 FPS on a Jetson AGX Orin. The approach demonstrates that a carefully designed RGB-D Transformer, combined with task-specific decoders and optimized deployment, enables robust, on-device multi-task scene analysis for indoor environments.

Abstract

Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.

Efficient Multi-Task Scene Analysis with RGB-D Transformers

TL;DR

This work tackles the challenge of real-time, multi-task scene understanding for mobile robots by introducing EMSAFormer, a single RGB-D Swin Transformer encoder that jointly performs panoptic segmentation, instance orientation estimation, and scene classification. It replaces the previous dual CNN encoder with a unified Transformer backbone and adds a specialized TensorRT extension to achieve real-time inference on embedded hardware. Through extensive experiments on NYUv2, SUNRGB-D, and ScanNet, EMSAFormer achieves state-of-the-art or competitive results across tasks while delivering up to 39.1 FPS on a Jetson AGX Orin. The approach demonstrates that a carefully designed RGB-D Transformer, combined with task-specific decoders and optimized deployment, enables robust, on-device multi-task scene analysis for indoor environments.

Abstract

Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
Paper Structure (17 sections, 5 figures, 4 tables)

This paper contains 17 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Application (bottom) of our proposed efficient multi-task scene analysis approach with an RGB-D Transformer encoder, called EMSAFormer, that simultaneously performs panoptic segmentation, instance orientation estimation, and scene classification (top). See Fig. \ref{['fig:architecture']} for prediction colors.
  • Figure 2: Architecture of our proposed efficient multi-task scene analysis approach with a single RGB-D Transformer encoder (EMSAFormer) that simultaneously performs panoptic segmentation, instance orientation estimation, and scene classification. For further details and explanations, see Sec. \ref{['sec:main']}. Semantic colors are chosen as in emsanet2022ijcnn and are https://github.com/TUI-NICR/nicr-scene-analysis-datasets/blob/v0.5.3/nicr_scene_analysis_datasets/datasets/nyuv2/nyuv2.py#L193NYUv2-eccv2012. Panoptic is visualized by small color differences.
  • Figure 3: Original SwinV2-T architecture swinv2-cvpr2022 (top) and our modifications (bottom) to efficiently incorporate depth information in a single encoder backbone.
  • Figure 4: Results on NYUv2 test split when performing semantic segmentation (top) and instance segmentation (bottom) in a single-task setting with various encoder configurations over inference throughput (NIVIDA Jetson AGX Orin 32$\,$GB, Jetpack 5.1.1, TensorRT 8.5.2, Float16, 50$\,$W). See Sec. \ref{['sec:experiments:metrics']} for metrics.
  • Figure 5: Qualitative results as RGB image overlayed with predicted panoptic segmentation, predicted scene class, and estimated orientations if available.