Table of Contents
Fetching ...

Mobile-Seed: Joint Semantic Segmentation and Boundary Detection for Mobile Robots

Youqi Liao, Shuhao Kang, Jianping Li, Yang Liu, Yun Liu, Zhen Dong, Bisheng Yang, Xieyuanli Chen

TL;DR

Mobile-Seed tackles the need for boundary-aware semantic perception on edge devices by jointly learning semantic segmentation and boundary detection with a two-stream encoder and an Active Fusion Decoder that dynamically fuses features; a dual-task regularization mitigates training conflicts. Empirically, it achieves substantial mIoU and boundary performance gains on Cityscapes while maintaining real-time inference, and generalizes to CamVid and PASCAL Context, illustrating practical impact for semantic SLAM and edge robotics. The contributions include the lightweight dual-stream architecture, the AFD for input-conditioned fusion, and the dual-task regularization losses, enabling efficient, boundary-precise perception for mobile robots.

Abstract

Precise and rapid delineation of sharp boundaries and robust semantics is essential for numerous downstream robotic tasks, such as robot grasping and manipulation, real-time semantic mapping, and online sensor calibration performed on edge computing units. Although boundary detection and semantic segmentation are complementary tasks, most studies focus on lightweight models for semantic segmentation but overlook the critical role of boundary detection. In this work, we introduce Mobile-Seed, a lightweight, dual-task framework tailored for simultaneous semantic segmentation and boundary detection. Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach. The encoder is divided into two pathways: one captures category-aware semantic information, while the other discerns boundaries from multi-scale features. The AFD module dynamically adapts the fusion of semantic and boundary information by learning channel-wise relationships, allowing for precise weight assignment of each channel. Furthermore, we introduce a regularization loss to mitigate the conflicts in dual-task learning and deep diversity supervision. Compared to existing methods, the proposed Mobile-Seed offers a lightweight framework to simultaneously improve semantic segmentation performance and accurately locate object boundaries. Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline by 2.2 percentage points (pp) in mIoU and 4.2 pp in mF-score, while maintaining an online inference speed of 23.9 frames-per-second (FPS) with 1024x2048 resolution input on an RTX 2080 Ti GPU. Additional experiments on CamVid and PASCAL Context datasets confirm our method's generalizability. Code and additional results are publicly available at https://whu-usi3dv.github.io/Mobile-Seed/.

Mobile-Seed: Joint Semantic Segmentation and Boundary Detection for Mobile Robots

TL;DR

Mobile-Seed tackles the need for boundary-aware semantic perception on edge devices by jointly learning semantic segmentation and boundary detection with a two-stream encoder and an Active Fusion Decoder that dynamically fuses features; a dual-task regularization mitigates training conflicts. Empirically, it achieves substantial mIoU and boundary performance gains on Cityscapes while maintaining real-time inference, and generalizes to CamVid and PASCAL Context, illustrating practical impact for semantic SLAM and edge robotics. The contributions include the lightweight dual-stream architecture, the AFD for input-conditioned fusion, and the dual-task regularization losses, enabling efficient, boundary-precise perception for mobile robots.

Abstract

Precise and rapid delineation of sharp boundaries and robust semantics is essential for numerous downstream robotic tasks, such as robot grasping and manipulation, real-time semantic mapping, and online sensor calibration performed on edge computing units. Although boundary detection and semantic segmentation are complementary tasks, most studies focus on lightweight models for semantic segmentation but overlook the critical role of boundary detection. In this work, we introduce Mobile-Seed, a lightweight, dual-task framework tailored for simultaneous semantic segmentation and boundary detection. Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach. The encoder is divided into two pathways: one captures category-aware semantic information, while the other discerns boundaries from multi-scale features. The AFD module dynamically adapts the fusion of semantic and boundary information by learning channel-wise relationships, allowing for precise weight assignment of each channel. Furthermore, we introduce a regularization loss to mitigate the conflicts in dual-task learning and deep diversity supervision. Compared to existing methods, the proposed Mobile-Seed offers a lightweight framework to simultaneously improve semantic segmentation performance and accurately locate object boundaries. Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline by 2.2 percentage points (pp) in mIoU and 4.2 pp in mF-score, while maintaining an online inference speed of 23.9 frames-per-second (FPS) with 1024x2048 resolution input on an RTX 2080 Ti GPU. Additional experiments on CamVid and PASCAL Context datasets confirm our method's generalizability. Code and additional results are publicly available at https://whu-usi3dv.github.io/Mobile-Seed/.
Paper Structure (13 sections, 13 equations, 9 figures, 6 tables)

This paper contains 13 sections, 13 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: (a) Motivation map: Mobile-Seed performs pixel-wise segmentation and object boundary detection simultaneously, and then fuses semantic and boundary features for accurate prediction. The boundary detection and semantic segmentation predictions could be transferred for downstream tasks, e.g., robot manipulation, semantic mapping and sensor calibration. (b) Our Mobile-Seed achieves higher performance on both semantic segmentation and boundary detection tasks while keeping real-time efficiency. The resolution of input is 1024$\times$2048 when testing inference speed. "AFF" and "Seg" mean the AFFormerdong2023head and SegFormerxie2021segformer, respectively.
  • Figure 2: Example diagram of color image (a), semantic mask (b), semantic boundary mask (c) and binary boundary mask (d). Semantic boundary masks are generated as yu2017casenethu2019dynamic, and binary boundary masks are generated as liu2022semantic.
  • Figure 3: Workflow of Mobile-Seed, where the semantic stream $\mathcal{S}$ and boundary stream $\mathcal{B}$ extract semantic and boundary features respectively. AFD estimates the relative weights for each channel of semantic features $\boldsymbol{F}_s$ and boundary features $\boldsymbol{F}_b$. An auxiliary classification head is applied to the semantic stream for direct supervision during training. Semantic prediction $\boldsymbol{s}$, fused semantic prediction $\boldsymbol{s}_f$, and boundary prediction $\boldsymbol{b}$ are supervised separately and accordingly. Regularization loss $\mathcal{L}_{reg}$ mitigates the divergences caused by dual-task learning.
  • Figure 4: Examples of boundary maps from the boundary stream. The first column shows the input, the second shows the boundary predictions, and the last column shows the ground-truth boundaries.
  • Figure 5: Illustration of the proposed AFD.
  • ...and 4 more figures