MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection

Yuxue Yang; Lue Fan; Zhaoxiang Zhang

MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection

Yuxue Yang, Lue Fan, Zhaoxiang Zhang

TL;DR

MixSup tackles label efficiency in LiDAR-based 3D detection by combining abundant coarse cluster-level semantic labels with a limited number of accurate box-level labels to jointly learn semantics and geometry. It redesigns label assignment to be detector-friendly, enabling easy integration with mainstream detectors, and introduces PointSAM to automate coarse labeling via SAM, further reducing annotation burden. Across nuScenes, Waymo, and KITTI, MixSup attains up to approximately 97% of fully supervised performance using only 10% box annotations plus cheap cluster labels, demonstrating strong practical efficiency. The approach is compatible with simple self-training and can be extended with auto-labelers, offering a scalable, versatile path toward cost-effective LiDAR perception without substantial accuracy loss.

Abstract

Label-efficient LiDAR-based 3D object detection is currently dominated by weakly/semi-supervised methods. Instead of exclusively following one of them, we propose MixSup, a more practical paradigm simultaneously utilizing massive cheap coarse labels and a limited number of accurate labels for Mixed-grained Supervision. We start by observing that point clouds are usually textureless, making it hard to learn semantics. However, point clouds are geometrically rich and scale-invariant to the distances from sensors, making it relatively easy to learn the geometry of objects, such as poses and shapes. Thus, MixSup leverages massive coarse cluster-level labels to learn semantics and a few expensive box-level labels to learn accurate poses and shapes. We redesign the label assignment in mainstream detectors, which allows them seamlessly integrated into MixSup, enabling practicality and universality. We validate its effectiveness in nuScenes, Waymo Open Dataset, and KITTI, employing various detectors. MixSup achieves up to 97.31% of fully supervised performance, using cheap cluster annotations and only 10% box annotations. Furthermore, we propose PointSAM based on the Segment Anything Model for automated coarse labeling, further reducing the annotation burden. The code is available at https://github.com/BraveGroup/PointSAM-for-MixSup.

MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection

TL;DR

Abstract

Paper Structure (49 sections, 3 equations, 9 figures, 15 tables)

This paper contains 49 sections, 3 equations, 9 figures, 15 tables.

Introduction
Related Work
LiDAR-based 3D Object Detection
Semi-supervised Learning in 3D
Weakly Supervised Learning
Pilot Study: What Really Matters for Label Efficiency
Method
Cluster-level Coarse Label
Coarse Label Assignment
Center-based Assignment and Inconsistency Removal
Box-based Assignment
Ambiguity of Box-based Assignment
PointSAM for Coarse Label Generation
SAM-assisted 3D Instance Segmentation
Separability-Aware Refinement (SAR)
...and 34 more sections

Figures (9)

Figure 1: Illustration of distinct properties of point clouds compared to images. They make semantic learning from points difficult but ease the estimation of geometry, which is the initial motivation of MixSup.
Figure 2: Illustration of the pilot study. We develop a well-classified dataset to factor out the classification and only focus on the influence of varying data amounts on geometry estimation.
Figure 3: Overview of MixSup. The massive cluster-level labels serve for semantic learning and a few box labels are used to learn geometry attributes. We redesign the label assignment to integrate various detectors into MixSup.
Figure 4: Illustration of Box-cluster IoU.
Figure 5: Overall of PointSAM.
...and 4 more figures

MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection

TL;DR

Abstract

MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (9)