Table of Contents
Fetching ...

Towards Explainable, Safe Autonomous Driving with Language Embeddings for Novelty Identification and Active Learning: Framework and Experimental Analysis with Real-World Data Sets

Ross Greer, Mohan Trivedi

TL;DR

The paper tackles novelty in autonomous driving, where open-set, high-level scene reasoning is needed beyond traditional safety metrics. It proposes using language embeddings via CLIP to identify novel driving scenes and to support safety takeovers and active learning across real-world datasets. Novelty is detected by clustering CLIP-based image embeddings and labeling unclustered images as novel, with textual explanations of novelty generated by a language-vision–LLM pipeline, demonstrated on LAVA and TUMTraf. The results show effective isolation of novel scenes and plausible explanations, suggesting practical impact for safe takeovers, data curation, and multi-task active learning in real-world autonomous driving deployments.

Abstract

This research explores the integration of language embeddings for active learning in autonomous driving datasets, with a focus on novelty detection. Novelty arises from unexpected scenarios that autonomous vehicles struggle to navigate, necessitating higher-level reasoning abilities. Our proposed method employs language-based representations to identify novel scenes, emphasizing the dual purpose of safety takeover responses and active learning. The research presents a clustering experiment using Contrastive Language-Image Pretrained (CLIP) embeddings to organize datasets and detect novelties. We find that the proposed algorithm effectively isolates novel scenes from a collection of subsets derived from two real-world driving datasets, one vehicle-mounted and one infrastructure-mounted. From the generated clusters, we further present methods for generating textual explanations of elements which differentiate scenes classified as novel from other scenes in the data pool, presenting qualitative examples from the clustered results. Our results demonstrate the effectiveness of language-driven embeddings in identifying novel elements and generating explanations of data, and we further discuss potential applications in safe takeovers, data curation, and multi-task active learning.

Towards Explainable, Safe Autonomous Driving with Language Embeddings for Novelty Identification and Active Learning: Framework and Experimental Analysis with Real-World Data Sets

TL;DR

The paper tackles novelty in autonomous driving, where open-set, high-level scene reasoning is needed beyond traditional safety metrics. It proposes using language embeddings via CLIP to identify novel driving scenes and to support safety takeovers and active learning across real-world datasets. Novelty is detected by clustering CLIP-based image embeddings and labeling unclustered images as novel, with textual explanations of novelty generated by a language-vision–LLM pipeline, demonstrated on LAVA and TUMTraf. The results show effective isolation of novel scenes and plausible explanations, suggesting practical impact for safe takeovers, data curation, and multi-task active learning in real-world autonomous driving deployments.

Abstract

This research explores the integration of language embeddings for active learning in autonomous driving datasets, with a focus on novelty detection. Novelty arises from unexpected scenarios that autonomous vehicles struggle to navigate, necessitating higher-level reasoning abilities. Our proposed method employs language-based representations to identify novel scenes, emphasizing the dual purpose of safety takeover responses and active learning. The research presents a clustering experiment using Contrastive Language-Image Pretrained (CLIP) embeddings to organize datasets and detect novelties. We find that the proposed algorithm effectively isolates novel scenes from a collection of subsets derived from two real-world driving datasets, one vehicle-mounted and one infrastructure-mounted. From the generated clusters, we further present methods for generating textual explanations of elements which differentiate scenes classified as novel from other scenes in the data pool, presenting qualitative examples from the clustered results. Our results demonstrate the effectiveness of language-driven embeddings in identifying novel elements and generating explanations of data, and we further discuss potential applications in safe takeovers, data curation, and multi-task active learning.
Paper Structure (19 sections, 5 equations, 13 figures, 3 tables, 2 algorithms)

This paper contains 19 sections, 5 equations, 13 figures, 3 tables, 2 algorithms.

Figures (13)

  • Figure 1: Natural language serves as a form of feature extraction, whereby data can be represented by meaningful description immediately understandable to a human reader. Such representations can also be generated by machines using vision-language models, and we present algorithms by which such representations (in both final and intermediate forms) can serve tasks of novelty identification in autonomous driving, useful towards anomaly detection and active learning tasks.
  • Figure 2: There are many important tasks to solve for the autonomous vehicle in this scene: detection of obstacles and external agents, prediction of agent trajectories for safe planning, and interpretation of traffic control elements for control decisions. For a limited data budget, at what point does it become more beneficial for a learning model to bring in new scenes instead of variants of old scenes? Does the information gain of data in new scenes exceed the information gain of variants of old scenes across all tasks?
  • Figure 3: In cohn1994improving, Cohn et al. use an abstract setting like the figure shown on left to suggest that there are many possible models (black rectangles) which could be used to classify the points, but that this model performance does not necessarily indicate a complete and accurate learning of the appropriate concept. By sampling in the spaces where the model may be uncertain, a stronger refining of the model boundary can occur, leading to improved generalizability. On right, we abstractly show how this manner of thinking might be applied to similar active learning for autonomous driving. In the center, we have scenes which contain pedestrians, as opposed to scenes without outside. A region shaded in yellow indicates a hypothetical region where the model could benefit from sampling, to narrow its hypothesis of what separates pedestrian scenes from others. However, the general problem of safe autonomy is much more complex, where multiple tasks (such as object detection, tracking, and localization) must all be met with high performance, and a point sampled as uncertain toward one task may be redundant to another. Further, the high-dimensional nature of the data does not reduce to such an easily-separable space. In this research, we propose that language-based embeddings of scene images are a useful reduction for identification of novel qualities, on the premise that sampling novelty may be useful towards multi-task model improvement.
  • Figure 4: If we view deep learning (and machine learning in general) as a process by which parameters algorithmically extract useful features from data (by means of converting data from its original structure to a structure of abstract, lower-dimensional, intelligent meaning), then we can consider each data point to be projected into a variety of spaces of varying dimension throughout the forward process. For a model to be successfully fit to its task (i.e. not overfit nor underfit), at some point, the data must reach a meaningful, useful projected representation. An example projection is depicted in the two graphs on right. Presumably, each point carries with it some "coverage" of the latent space, shown with a black radius, such that similar points not found in the training set would receive similar prediction by the model. When we add new data to train a model, such as the candidates shown in yellow and red in the middle graph, we would like to be efficient, adding only data which improves the model's coverage of the problem latent space. The driving question of this research is: what descriptors or features make a useful representation, such that an algorithm can quickly identify points which are less useful (such as the point shown in red)? Do these descriptors come from high-level abstract meaning, as we show on the left with human-understandable features like number of pedestrians, speed, and weather? Or, should these descriptors emerge from an embedded, learned feature directed from the raw sensor input and the model's own transformations of this input, trading explainability for optimality? How can these descriptors be leveraged towards active learning, and what implications do these choices make towards curating and annotating such datasets?
  • Figure 5: An overview of the method presented in this paper. Scene images from a pool of driving scenarios are input to a Contrastive Language-Image Pretrained image encoder. The resulting embedding could be used in a text decoder for image captioning, but instead, we perform clustering over the resulting embedding vectors from a large pool of samples, as shown at right. Images whose representation appears independent of the identified clusters, such as the one in white at the center of the representation space, are considered to be novel. The experiments shared in this research describe whether or not the novelty identified by this method aligns with the concepts of novelty reflected in the organization of the datasets.
  • ...and 8 more figures