TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection
Chenyang Lei, Weiyuan Peng, Guang Zhou, Meiying Zhang, Qi Hao, Chunlin Ji, Chengzhong Xu
TL;DR
TSceneJAL addresses the costly labeling bottleneck in 3D object detection for autonomous driving by proposing a three-stage active learning framework that jointly optimizes category balance, scene diversity, and data complexity. It leverages a category-entropy-based first stage, a directed-graph scene-similarity second stage with a marginalized kernel for diversity, and a mixture-density-network-based uncertainty third stage to target complex scenes, all integrated into a multi-stage sampling loop. Across KITTI, Lyft, nuScenes, and SUScape, TSceneJAL achieves consistent improvements in $mAP_{3D}$ and $mAP_{BEV}$ with reduced annotation effort, sometimes surpassing fully supervised baselines as data scales. The approach combines MDN-based regression uncertainty with graph-based scene representations to deliver data-efficient learning, and it shows practical potential for reducing labeling costs in real-world AD pipelines.
Abstract
Most autonomous driving (AD) datasets incur substantial costs for collection and labeling, inevitably yielding a plethora of low-quality and redundant data instances, thereby compromising performance and efficiency. Many applications in AD systems necessitate high-quality training datasets using both existing datasets and newly collected data. In this paper, we propose a traffic scene joint active learning (TSceneJAL) framework that can efficiently sample the balanced, diverse, and complex traffic scenes from both labeled and unlabeled data. The novelty of this framework is threefold: 1) a scene sampling scheme based on a category entropy, to identify scenes containing multiple object classes, thus mitigating class imbalance for the active learner; 2) a similarity sampling scheme, estimated through the directed graph representation and a marginalize kernel algorithm, to pick sparse and diverse scenes; 3) an uncertainty sampling scheme, predicted by a mixture density network, to select instances with the most unclear or complex regression outcomes for the learner. Finally, the integration of these three schemes in a joint selection strategy yields an optimal and valuable subdataset. Experiments on the KITTI, Lyft, nuScenes and SUScape datasets demonstrate that our approach outperforms existing state-of-the-art methods on 3D object detection tasks with up to 12% improvements.
