Table of Contents
Fetching ...

LiDAR-Camera Fusion for Video Panoptic Segmentation without Video Training

Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi, Mohammad Rahmati

TL;DR

This work tackles the problem of achieving high-quality image and video panoptic segmentation without relying on video training data. It introduces a LiDAR-image feature fusion module and two lightweight transformer-query augmentations—Location-Aware Queries ($LAQ$) and Time-Aware Queries ($TAQ$)—to boost both PS and VPS, with depth information from LiDAR or stereo helping the segmentation. The proposed fusion method, together with $LAQ$ (and to a lesser extent $TAQ$ in some settings), yields up to 5-point improvements in panoptic quality and substantial gains in VPQ compared to a baseline video-free model, bringing performance closer to video-supervised methods. The approach offers practical benefits for autonomous driving where video training data may be scarce, by leveraging depth cues and smart query strategies to improve scene understanding without video-specific training.

Abstract

Panoptic segmentation, which combines instance and semantic segmentation, has gained a lot of attention in autonomous vehicles, due to its comprehensive representation of the scene. This task can be applied for cameras and LiDAR sensors, but there has been a limited focus on combining both sensors to enhance image panoptic segmentation (PS). Although previous research has acknowledged the benefit of 3D data on camera-based scene perception, no specific study has explored the influence of 3D data on image and video panoptic segmentation (VPS).This work seeks to introduce a feature fusion module that enhances PS and VPS by fusing LiDAR and image data for autonomous vehicles. We also illustrate that, in addition to this fusion, our proposed model, which utilizes two simple modifications, can further deliver even more high-quality VPS without being trained on video data. The results demonstrate a substantial improvement in both the image and video panoptic segmentation evaluation metrics by up to 5 points.

LiDAR-Camera Fusion for Video Panoptic Segmentation without Video Training

TL;DR

This work tackles the problem of achieving high-quality image and video panoptic segmentation without relying on video training data. It introduces a LiDAR-image feature fusion module and two lightweight transformer-query augmentations—Location-Aware Queries () and Time-Aware Queries ()—to boost both PS and VPS, with depth information from LiDAR or stereo helping the segmentation. The proposed fusion method, together with (and to a lesser extent in some settings), yields up to 5-point improvements in panoptic quality and substantial gains in VPQ compared to a baseline video-free model, bringing performance closer to video-supervised methods. The approach offers practical benefits for autonomous driving where video training data may be scarce, by leveraging depth cues and smart query strategies to improve scene understanding without video-specific training.

Abstract

Panoptic segmentation, which combines instance and semantic segmentation, has gained a lot of attention in autonomous vehicles, due to its comprehensive representation of the scene. This task can be applied for cameras and LiDAR sensors, but there has been a limited focus on combining both sensors to enhance image panoptic segmentation (PS). Although previous research has acknowledged the benefit of 3D data on camera-based scene perception, no specific study has explored the influence of 3D data on image and video panoptic segmentation (VPS).This work seeks to introduce a feature fusion module that enhances PS and VPS by fusing LiDAR and image data for autonomous vehicles. We also illustrate that, in addition to this fusion, our proposed model, which utilizes two simple modifications, can further deliver even more high-quality VPS without being trained on video data. The results demonstrate a substantial improvement in both the image and video panoptic segmentation evaluation metrics by up to 5 points.
Paper Structure (19 sections, 2 equations, 3 figures, 2 tables)

This paper contains 19 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (a) Example image from the cityscapes dataset. (b) Semi-sparse LiDAR points simulated from estimated disparity xu2023unifying. (c) Upsampled LiDAR points using ku2018defense. (d) Output of the panoptic segmentation model.
  • Figure 2: Overall architecture of our proposed method, which is based on Mask2Former cheng2022masked. The parts with a yellow background are our contributions
  • Figure 3: Panoptic segmentation output for a video sequence. The base model (left) has significantly more ID switches compared to our proposed method (right). Some ID switches are denoted with bounding boxes.