Language-Driven Active Learning for Diverse Open-Set 3D Object Detection

Ross Greer; Bjørk Antoniussen; Andreas Møgelmose; Mohan Trivedi

Language-Driven Active Learning for Diverse Open-Set 3D Object Detection

Ross Greer, Bjørk Antoniussen, Andreas Møgelmose, Mohan Trivedi

TL;DR

The paper tackles open-set 3D object detection in autonomous driving by addressing data imbalance and novel object diversity. It introduces VisLED-Querying, a language-driven active learning framework that uses vision-language embeddings to measure novelty and perform diversity-based sample selection in two modes: Open-World Exploring and Closed-World Mining. Evaluated on nuScenes with BEVFusion, VisLED consistently surpasses random sampling and approaches entropy-based querying without model-specific optimization, demonstrating data-efficient gains and improved detection of underrepresented classes. These results suggest VisLED can reduce annotation costs and enhance safety-critical perception, with potential for extension to multi-task open-set learning and other datasets.

Abstract

Object detection is crucial for ensuring safe autonomous driving. However, data-driven approaches face challenges when encountering minority or novel objects in the 3D driving scene. In this paper, we propose VisLED, a language-driven active learning framework for diverse open-set 3D Object Detection. Our method leverages active learning techniques to query diverse and informative data samples from an unlabeled pool, enhancing the model's ability to detect underrepresented or novel objects. Specifically, we introduce the Vision-Language Embedding Diversity Querying (VisLED-Querying) algorithm, which operates in both open-world exploring and closed-world mining settings. In open-world exploring, VisLED-Querying selects data points most novel relative to existing data, while in closed-world mining, it mines novel instances of known classes. We evaluate our approach on the nuScenes dataset and demonstrate its efficiency compared to random sampling and entropy-querying methods. Our results show that VisLED-Querying consistently outperforms random sampling and offers competitive performance compared to entropy-querying despite the latter's model-optimality, highlighting the potential of VisLED for improving object detection in autonomous driving scenarios. We make our code publicly available at https://github.com/Bjork-crypto/VisLED-Querying

Language-Driven Active Learning for Diverse Open-Set 3D Object Detection

TL;DR

Abstract

Paper Structure (10 sections, 6 figures, 1 table, 2 algorithms)

This paper contains 10 sections, 6 figures, 1 table, 2 algorithms.

Introduction
Related Research
The Role of Uncertainty and Diversity-Based Methods in Closed and Open Set Learning
Learning from Vision-Language Representations
Algorithm
Experimental Evaluation
Dataset
3D Object Detection Model
Experiments and Results
Discussion and Conclusion

Figures (6)

Figure 1: Choosing the most informative data can impact object detection model performance. Images in the left column are the results of a model trained on 50% of nuScenes data, selected at random. Images in the right column are the results on the same images of a model trained on 50% of nuScenes data, but selected using our VisLED active learning query strategy. In the top two rows, we see cases where challenging pedestrians are missed on the left image (preparing to cross on the right side of the road, and standing behind the crossing pole, respectively), but correctly detected on the right. Similarly, in the bottom two rows, the under-represented classes of motorcycle and truck are more readily detected using our active learning strategy.
Figure 2: VisLED System Overview. For both Open-World Exploring and Closed-World Mining, the system begins with the processing of the unlabeled data pool into vision-language embedding representations. In Open-World Exploring, these embeddings are clustered and used as the basis for a query. In Closed-World Mining, the embeddings are first used in zero-shot learning to classify scenes based on object appearance, and then further clustered per-class, offering a chance to sample from particular classes which are known to be minority in the labeled training set.
Figure 3: BEVFusion models are trained using three different data selections: Random (dot markers and solid line), VisLED-Closed-World (x-markers and dashed line), and VisLED-Open-World (+-markers and dotted line). The top graph illustrates detection performance, while the bottom graph illustrates performance difference relative to the random-selection baseline. Performance is averaged over 5 complete data selection + training runs of each model at each training pool size.
Figure 4: The bicycle and motorcycle classes are least represented in the nuScenes dataset, which causes these classes to appear infrequently during training when selecting data with random sampling. By using VisLED to sample, more bicycle and motorcycle instances are drawn, leading to a performance gain at early data increments. This gain levels off as the training pool aggregates all bicycle and motorcycle samples.
Figure 5: From class performance, the trailer and construction vehicle classes are most challenging to learn. When VisLED querying is used, informative samples from these classes are pulled into the training pool, giving stronger detection performance than random sampling at nearly all data volumes.
...and 1 more figures

Language-Driven Active Learning for Diverse Open-Set 3D Object Detection

TL;DR

Abstract

Language-Driven Active Learning for Diverse Open-Set 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)