Table of Contents
Fetching ...

Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving

Zhenjiang Mao, Dong-You Jhong, Ao Wang, Ivan Ruchkin

TL;DR

The paper tackles the challenge of out-of-distribution detection in autonomous driving by introducing language-enhanced latent representations that allow user-defined, language-guided anomaly detection. It leverages CLIP to compute cosine similarities between image embeddings and text embeddings derived from natural-language prompts, and it combines this language signal with conventional latent encodings. A Gamma-distribution-based thresholding scheme is used to separate in-distribution from out-of-distribution inputs, with extensive simulations showing that anomalous-language prompts can yield the strongest improvements, especially when language features are integrated with standard latent representations using an optimal length ratio (~1:3). The work demonstrates that language-driven latent representations can increase transparency and user control in OOD detection for driving systems, paving the way for interactive, prompt-based anomaly focus in real-world perception pipelines.

Abstract

Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.

Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving

TL;DR

The paper tackles the challenge of out-of-distribution detection in autonomous driving by introducing language-enhanced latent representations that allow user-defined, language-guided anomaly detection. It leverages CLIP to compute cosine similarities between image embeddings and text embeddings derived from natural-language prompts, and it combines this language signal with conventional latent encodings. A Gamma-distribution-based thresholding scheme is used to separate in-distribution from out-of-distribution inputs, with extensive simulations showing that anomalous-language prompts can yield the strongest improvements, especially when language features are integrated with standard latent representations using an optimal length ratio (~1:3). The work demonstrates that language-driven latent representations can increase transparency and user control in OOD detection for driving systems, paving the way for interactive, prompt-based anomaly focus in real-world perception pipelines.

Abstract

Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.
Paper Structure (11 sections, 3 equations, 2 figures, 1 table)

This paper contains 11 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Examples of language-based encoding with normal (green texts) and anomalous description (red texts) by calculating the cosine similarity (CS) between the text and image representations.
  • Figure 2: Out-of-distribution types (left) and the architecture of OOD detection with latent representation distance (right).