Table of Contents
Fetching ...

A Semantic Observer Layer for Autonomous Vehicles: Pre-Deployment Feasibility Study of VLMs for Low-Latency Anomaly Detection

Kunal Runwal, Swaraj Gajare, Daniel Adejumo, Omkar Ankalkope, Siddhant Baroth, Aliasghar Arab

Abstract

Semantic anomalies-context-dependent hazards that pixel-level detectors cannot reason about-pose a critical safety risk in autonomous driving. We propose a \emph{semantic observer layer}: a quantized vision-language model (VLM) running at 1--2\,Hz alongside the primary AV control loop, monitoring for semantic edge cases, and triggering fail-safe handoffs when detected. Using Nvidia Cosmos-Reason1-7B with NVFP4 quantization and FlashAttention2, we achieve ~500 ms inference a ~50x speedup over the unoptimized FP16 baseline (no quantization, standard PyTorch attention) on the same hardware--satisfying the observer timing budget. We benchmark accuracy, latency, and quantization behavior in static and video conditions, identify NF4 recall collapse (10.6%) as a hard deployment constraint, and a hazard analysis mapping performance metrics to safety goals. The results establish a pre-deployment feasibility case for the semantic observer architecture on embodied-AI AV platforms.

A Semantic Observer Layer for Autonomous Vehicles: Pre-Deployment Feasibility Study of VLMs for Low-Latency Anomaly Detection

Abstract

Semantic anomalies-context-dependent hazards that pixel-level detectors cannot reason about-pose a critical safety risk in autonomous driving. We propose a \emph{semantic observer layer}: a quantized vision-language model (VLM) running at 1--2\,Hz alongside the primary AV control loop, monitoring for semantic edge cases, and triggering fail-safe handoffs when detected. Using Nvidia Cosmos-Reason1-7B with NVFP4 quantization and FlashAttention2, we achieve ~500 ms inference a ~50x speedup over the unoptimized FP16 baseline (no quantization, standard PyTorch attention) on the same hardware--satisfying the observer timing budget. We benchmark accuracy, latency, and quantization behavior in static and video conditions, identify NF4 recall collapse (10.6%) as a hard deployment constraint, and a hazard analysis mapping performance metrics to safety goals. The results establish a pre-deployment feasibility case for the semantic observer architecture on embodied-AI AV platforms.

Paper Structure

This paper contains 27 sections, 14 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 3: Qualitative results on the Hazard Perception Test Dataset theorypass_hazard_perception using Cosmos-Reason1-7B nvidia2025cosmosreason1physicalcommonsense. Top row (Samples 11--12): correctly classified normal frames. Bottom row (Samples 13--14): a normal frame and a detected anomaly, demonstrating context-aware semantic reasoning.
  • Figure 4: Semantic observer layer architecture. The VLM observer runs at 1--2 Hz alongside the primary AV control loop, processing temporal windows of RGB frames with a structured prompt. Upon detecting a high-confidence semantic constraint violation, it triggers a fail-safe handoff. Visual tokens from Cosmos-Reason1-7B are projected into the language embedding space and evaluated against context-conditioned semantic constraints to produce a binary {Normal, Anomaly} decision.
  • Figure 5: High-level architecture of Cosmos-Reason1-7B for anomaly detection. Visual features from the vision encoder are projected into the language embedding space and jointly processed with prompt tokens by a decoder-only transformer backbone (see Fig. 4 for block details).
  • Figure 6: Architecture used in Cosmos-Reason1-7B. Visual tokens extracted by the Qwen2.5-VL vision encoder are projected into the language embedding space via a two-layer MLP merger and concatenated with prompt tokens. NVFP4 quantization is applied to the backbone weight matrices, and FlashAttention2 accelerates attention computation. Cosmos-Reason1-7B retains the Qwen2.5-VL architecture and is further fine-tuned on robotics and embodied reasoning data for physical AI tasks.
  • Figure 7: Dataset orchestration for FCDD training. RDD2022arya2024rdd2022 images (all damage types merged) serve as the anomalous class, while Cityscapescordts2016cityscapes images filtered for $\geq$25% road coverage provide the normal class. An 80/20 train-test split yields 31,386 training samples (2,598 normal, 28,788 anomalous) and 7,643 test samples (447 normal, 7,196 anomalous).
  • ...and 2 more figures