Table of Contents
Fetching ...

MaskBEV: Joint Object Detection and Footprint Completion for Bird's-eye View 3D Point Clouds

William Guimont-Martin, Jean-Michel Fortin, François Pomerleau, Philippe Giguère

TL;DR

The paper tackles the challenges of 3D object detection in LiDAR point clouds by moving away from bounding boxes toward BEV instance masks that encode full object footprints. MaskBEV employs a PointPillars-like encoder to produce BEV features and a Mask2Former-inspired mask predictor to output a set of binary BEV masks with class labels, eliminating regression and NMS-heavy post-processing. To train such a model, the authors generate complete BEV masks by aggregating multi-scan data and applying morphological operations, enabling learned object completion even under occlusion. Evaluation on SemanticKITTI and KITTI demonstrates the approach's viability, showing robust footprint completion and competitive mask-based metrics, while also highlighting limitations tied to dataset size and scene complexity. This work opens a new path for mask-based 3D detection in LiDAR data and suggests benefits from larger datasets and broader object categories.

Abstract

Recent works in object detection in LiDAR point clouds mostly focus on predicting bounding boxes around objects. This prediction is commonly achieved using anchor-based or anchor-free detectors that predict bounding boxes, requiring significant explicit prior knowledge about the objects to work properly. To remedy these limitations, we propose MaskBEV, a bird's-eye view (BEV) mask-based object detector neural architecture. MaskBEV predicts a set of BEV instance masks that represent the footprints of detected objects. Moreover, our approach allows object detection and footprint completion in a single pass. MaskBEV also reformulates the detection problem purely in terms of classification, doing away with regression usually done to predict bounding boxes. We evaluate the performance of MaskBEV on both SemanticKITTI and KITTI datasets while analyzing the architecture advantages and limitations.

MaskBEV: Joint Object Detection and Footprint Completion for Bird's-eye View 3D Point Clouds

TL;DR

The paper tackles the challenges of 3D object detection in LiDAR point clouds by moving away from bounding boxes toward BEV instance masks that encode full object footprints. MaskBEV employs a PointPillars-like encoder to produce BEV features and a Mask2Former-inspired mask predictor to output a set of binary BEV masks with class labels, eliminating regression and NMS-heavy post-processing. To train such a model, the authors generate complete BEV masks by aggregating multi-scan data and applying morphological operations, enabling learned object completion even under occlusion. Evaluation on SemanticKITTI and KITTI demonstrates the approach's viability, showing robust footprint completion and competitive mask-based metrics, while also highlighting limitations tied to dataset size and scene complexity. This work opens a new path for mask-based 3D detection in LiDAR data and suggests benefits from larger datasets and broader object categories.

Abstract

Recent works in object detection in LiDAR point clouds mostly focus on predicting bounding boxes around objects. This prediction is commonly achieved using anchor-based or anchor-free detectors that predict bounding boxes, requiring significant explicit prior knowledge about the objects to work properly. To remedy these limitations, we propose MaskBEV, a bird's-eye view (BEV) mask-based object detector neural architecture. MaskBEV predicts a set of BEV instance masks that represent the footprints of detected objects. Moreover, our approach allows object detection and footprint completion in a single pass. MaskBEV also reformulates the detection problem purely in terms of classification, doing away with regression usually done to predict bounding boxes. We evaluate the performance of MaskBEV on both SemanticKITTI and KITTI datasets while analyzing the architecture advantages and limitations.
Paper Structure (18 sections, 5 figures, 2 tables)

This paper contains 18 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Mask prediction from MaskBEV. (a) BEV of a point cloud from SemanticKITTI's validation set. We overlay the ground truth masks capturing each object's footprint on the point cloud. (b) Mask predictions from MaskBEV overlayed on the input point cloud, each instance is predicted on a different mask, here shown in different colors. (c-h) Mask predictions before applying the sigmoid and thresholding operation. We clipped and normalized the raw predictions to make them easier to visualize. We can notice the black outlines of cars not predicted by a particular mask. This means that each query token specializes in detecting one instance while suppressing the others.
  • Figure 2: Mask generation from instance label. (a) Single scan point clouds only show the surface of the object that directly faces the LiDAR. (b) The mask generated from a single scan is partial and does not represent the complete footprint of the vehicle. (c) Using merged sequential LiDAR scans, it is possible to gather points from all around a vehicle. (d) Masks produced from the constructed map are complete, i.e., represent the entire footprint of the vehicle.
  • Figure 3: MaskBEV complete architecture. It has two main parts: an encoder and a mask prediction network. The encoder is responsible for converting a 3D point cloud into a BEV feature map. Then, this feature map is fed into a Mask2Former Cheng2022 network that outputs a set of classes prediction and binary BEV masks. Each class and mask pair represents a detection made by the network. These masks predict the footprint of each detected instance.
  • Figure 4: (a) Histogram of the ratio between the area of the largest mask generated from a single scan, $A_{single}$, to the area of the complete mask of the same instance, $A_{complete}$, for SemanticKITTI's validation split. Lower values of ratios indicate instances that are not well captured from any single scan. (b) Histogram of the ratio between the area of our prediction, $A_{pred}$, to the area of the complete mask $A_{complete}$ for the same instance, for SemanticKITTI's validation split. Most predictions are larger than their corresponding ground truth (i.e., ratios bigger than one), meaning that MaskBEV tends to overestimate the footprint of instances.
  • Figure 5: MaskBEV predictions on SemanticKITTI and KITTI datasets. The first three columns are samples from SemanticKITTI, and the rightmost one is from KITTI. The top row shows a sample of good predictions from the network. The bottom row displays failure cases to analyze the limitations of MaskBEV. We see that the network struggles with more complex scenes such as (e), (f) and (g). Missed detections are emphasized by red arrows. Smaller ground truths (i.e., rectangles that are too small to be vehicles), are filtered out by the process described in \ref{['sec:mask-gen']}, and thus are not used for evaluation.