Table of Contents
Fetching ...

LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

Calvin Galagain, Martyna Poreba, François Goulette, Cyrill Stachniss

Abstract

Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

Abstract

Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

Paper Structure

This paper contains 14 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: LiPS architecture. Encoder produces a four-level feature hierarchy. A routing step selects a subset of levels (1-4), which are downsampled with strided convolutions, enriched with sine positional encodings, and fused through a shallow deformable-attention pixel decoder. A lightweight top-down FPN exposes three mask-feature scales for the masked transformer decoder, which predicts class labels and panoptic masks from a fixed set of queries. By default, two routed levels (1 and 2) are used for embedded deployment. Dashed arrows denote skipped levels; solid arrows indicate active information flow.
  • Figure 2: GFLOPs vs. input size (log scale) for Mask2Former-R50 and LiPS variants with 1-4 routed levels. Blue arrows annotate the compute reduction factor , i-e., Mask2Former / LiPS (full), at each resolution.
  • Figure 3: Qualitative comparison on Cityscapes Columns (left$\to$right): (a) input, (b) ground truth, (c) LiPS (2 levels), (d) LiPS (full), (e) Mask2Former-R50. A fixed ROI is cropped and magnified identically across methods to enable like-for-like inspection of boundary adherence (e.g., curb/sidewalk), thin-structure fidelity (poles, traffic signs), and stuff/thing separation.