Table of Contents
Fetching ...

TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection

Chenyang Lei, Weiyuan Peng, Guang Zhou, Meiying Zhang, Qi Hao, Chunlin Ji, Chengzhong Xu

TL;DR

TSceneJAL addresses the costly labeling bottleneck in 3D object detection for autonomous driving by proposing a three-stage active learning framework that jointly optimizes category balance, scene diversity, and data complexity. It leverages a category-entropy-based first stage, a directed-graph scene-similarity second stage with a marginalized kernel for diversity, and a mixture-density-network-based uncertainty third stage to target complex scenes, all integrated into a multi-stage sampling loop. Across KITTI, Lyft, nuScenes, and SUScape, TSceneJAL achieves consistent improvements in $mAP_{3D}$ and $mAP_{BEV}$ with reduced annotation effort, sometimes surpassing fully supervised baselines as data scales. The approach combines MDN-based regression uncertainty with graph-based scene representations to deliver data-efficient learning, and it shows practical potential for reducing labeling costs in real-world AD pipelines.

Abstract

Most autonomous driving (AD) datasets incur substantial costs for collection and labeling, inevitably yielding a plethora of low-quality and redundant data instances, thereby compromising performance and efficiency. Many applications in AD systems necessitate high-quality training datasets using both existing datasets and newly collected data. In this paper, we propose a traffic scene joint active learning (TSceneJAL) framework that can efficiently sample the balanced, diverse, and complex traffic scenes from both labeled and unlabeled data. The novelty of this framework is threefold: 1) a scene sampling scheme based on a category entropy, to identify scenes containing multiple object classes, thus mitigating class imbalance for the active learner; 2) a similarity sampling scheme, estimated through the directed graph representation and a marginalize kernel algorithm, to pick sparse and diverse scenes; 3) an uncertainty sampling scheme, predicted by a mixture density network, to select instances with the most unclear or complex regression outcomes for the learner. Finally, the integration of these three schemes in a joint selection strategy yields an optimal and valuable subdataset. Experiments on the KITTI, Lyft, nuScenes and SUScape datasets demonstrate that our approach outperforms existing state-of-the-art methods on 3D object detection tasks with up to 12% improvements.

TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection

TL;DR

TSceneJAL addresses the costly labeling bottleneck in 3D object detection for autonomous driving by proposing a three-stage active learning framework that jointly optimizes category balance, scene diversity, and data complexity. It leverages a category-entropy-based first stage, a directed-graph scene-similarity second stage with a marginalized kernel for diversity, and a mixture-density-network-based uncertainty third stage to target complex scenes, all integrated into a multi-stage sampling loop. Across KITTI, Lyft, nuScenes, and SUScape, TSceneJAL achieves consistent improvements in and with reduced annotation effort, sometimes surpassing fully supervised baselines as data scales. The approach combines MDN-based regression uncertainty with graph-based scene representations to deliver data-efficient learning, and it shows practical potential for reducing labeling costs in real-world AD pipelines.

Abstract

Most autonomous driving (AD) datasets incur substantial costs for collection and labeling, inevitably yielding a plethora of low-quality and redundant data instances, thereby compromising performance and efficiency. Many applications in AD systems necessitate high-quality training datasets using both existing datasets and newly collected data. In this paper, we propose a traffic scene joint active learning (TSceneJAL) framework that can efficiently sample the balanced, diverse, and complex traffic scenes from both labeled and unlabeled data. The novelty of this framework is threefold: 1) a scene sampling scheme based on a category entropy, to identify scenes containing multiple object classes, thus mitigating class imbalance for the active learner; 2) a similarity sampling scheme, estimated through the directed graph representation and a marginalize kernel algorithm, to pick sparse and diverse scenes; 3) an uncertainty sampling scheme, predicted by a mixture density network, to select instances with the most unclear or complex regression outcomes for the learner. Finally, the integration of these three schemes in a joint selection strategy yields an optimal and valuable subdataset. Experiments on the KITTI, Lyft, nuScenes and SUScape datasets demonstrate that our approach outperforms existing state-of-the-art methods on 3D object detection tasks with up to 12% improvements.

Paper Structure

This paper contains 41 sections, 23 equations, 14 figures, 10 tables, 2 algorithms.

Figures (14)

  • Figure 1: A brief illustration of the proposed AL framework. The AL predictor generates pseudo-labels using the unlabeled data. The AL sampler employs a joint sampling strategy to select the most informative scenes. The oracle furnishes accurate labels for the sampled scenes, updating the labeled dataset which is then to retrain the AL predictor. The iterative nature of this process continues until the desired number of sampled scenes is attained.
  • Figure 2: An illustration of our TSceneJAL framework. The AL predictor $f_m$ is transformed by integrating a MDN and pre-trained using the initial labeled dataset $\mathcal{D}_i$. The trained AL predictor $f_m$ generates pseudo-labels and MDN predictions, $(\hat{\mathcal{C}},\hat{\mathcal{M}})$, from the unlabeled dataset $\mathcal{D}_u$. The AL sampler incorporates three metrics, namely category entropy (\ref{['sec:cate']}), scene similarity (\ref{['sec:similarity']}) and perception uncertainty (\ref{['sec:uncertainty']}), to evaluate the data. A three-stage hybrid sampling strategy (\ref{['sec:selection']}) is then used for data selection, resulting in a set of $N_r$ required scenes. These selected scenes are annotated by the oracle $\Omega$ and subsequently used for model retraining.
  • Figure 3: An illustration of graph representation of scenes and scene similarity estimation. The process comprises several steps: 1) Information extraction from point clouds using the AL predictor to obtain predicted target boxes and labels. 2) Construction of a fully connected undirected graph structure with box categories as nodes and distances as edges, including the addition of a node for the ego vehicle. 3) Transformation of the undirected graph into a directed graph, updating edge weights, and introducing mirrored nodes. 4) Utilization of random walks to generate multiple subgraphs. 5) Calculation of similarity between graphs using Eq. (\ref{['eq:simi_frame']}).
  • Figure 4: An illustration of the MDN and correponding uncertainy propagation. (a) The MDN's structure, enhancing the regression head of the PointPillar model by modeling the output as a mixture of multiple Gaussian distributions, while the classification head remains unaltered. (b) Given its anchor-based method, the model predicts the residual with respect to the anchor; thus, the uncertainty obtained by the MDN reflects the uncertainty associated with the residual.
  • Figure 5: Experiment results of different AL methods with an increasing number of sampled scenes on the KITTI val set. Initially, all methods use the same set of 200 scenes. Over five iterations, 32% of the whole data (1200 scenes) is selected. Notably, our method outperforms all other methods in the final two iterations in terms of $mAP_{3D}$.
  • ...and 9 more figures