Language-Enhanced Latent Representations for Out-of-Distribution Detection in Autonomous Driving
Zhenjiang Mao, Dong-You Jhong, Ao Wang, Ivan Ruchkin
TL;DR
The paper tackles the challenge of out-of-distribution detection in autonomous driving by introducing language-enhanced latent representations that allow user-defined, language-guided anomaly detection. It leverages CLIP to compute cosine similarities between image embeddings and text embeddings derived from natural-language prompts, and it combines this language signal with conventional latent encodings. A Gamma-distribution-based thresholding scheme is used to separate in-distribution from out-of-distribution inputs, with extensive simulations showing that anomalous-language prompts can yield the strongest improvements, especially when language features are integrated with standard latent representations using an optimal length ratio (~1:3). The work demonstrates that language-driven latent representations can increase transparency and user control in OOD detection for driving systems, paving the way for interactive, prompt-based anomaly focus in real-world perception pipelines.
Abstract
Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.
