Table of Contents
Fetching ...

The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

Jiabao Wang, Hongyu Zhou, Yuanbo Yang, Jiahao Shao, Yiyi Liao

TL;DR

The Constant Eye paper introduces navdream, a benchmark that decouples appearance from geometry to quantify how appearance shifts affect planning in autonomous driving. By applying pixel-aligned style transfer to NAVSIM sequences, navdream serves as a visual stress test with preserved geometry, enabling precise assessment of appearance robustness. The authors propose a universal perception interface built on a frozen DINOv3 backbone, coupled with a lightweight adapter, enabling zero-shot generalization across regression, diffusion, and scoring-based planners without target-domain training. Experiments on navdream and NAVSIM show that the frozen-vision interface maintains stable planning under severe appearance shifts and across multiple planning paradigms, highlighting its practical potential for robust, scalable perception in autonomous driving.

Abstract

Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.

The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

TL;DR

The Constant Eye paper introduces navdream, a benchmark that decouples appearance from geometry to quantify how appearance shifts affect planning in autonomous driving. By applying pixel-aligned style transfer to NAVSIM sequences, navdream serves as a visual stress test with preserved geometry, enabling precise assessment of appearance robustness. The authors propose a universal perception interface built on a frozen DINOv3 backbone, coupled with a lightweight adapter, enabling zero-shot generalization across regression, diffusion, and scoring-based planners without target-domain training. Experiments on navdream and NAVSIM show that the frozen-vision interface maintains stable planning under severe appearance shifts and across multiple planning paradigms, highlighting its practical potential for robust, scalable perception in autonomous driving.

Abstract

Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.
Paper Structure (28 sections, 10 equations, 10 figures, 7 tables)

This paper contains 28 sections, 10 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview. We present navdream, a benchmark featuring out-of-distribution appearances that cause significant performance degradation across regression, diffusion, and scoring-based planning algorithms. Our method addresses this by viewing images through a "constant eye", leveraging a frozen visual foundation model to bridge the gap in appearance robustness.
  • Figure 2: Visual taxonomy of the appearance-based OOD shifts in navdream. We illustrate the original frame alongside 10 synthesized stylistic variations generated by the Flux model. All transformations preserve the underlying 3D geometry and semantic structures while shifting the visual domain to various OOD conditions.
  • Figure 3: Method. We utilize a frozen DINOv3 backbone $\Phi$ to extract features from raw camera inputs and these features maintain consistent semantic information across visual domains. These structural representations are then processed by a lightweight feature adapter $\mathcal{A}$ to reduce dimensionality. This plug-and-play solution can be integrated into regression, diffusion, and scoring-based planning paradigms to ensure robust trajectory generation across varying appearance conditions.
  • Figure 4: Qualitative comparison of the planning results. We visualize the planning performance of the LTF baseline (left) and LTF-DINO (right) across diverse domains. While the baseline exhibits hazardous trajectories when encountering visual domain shifts, LTF-DINO consistently produces robust, consistent, and human-aligned paths despite severe environmental perturbations.
  • Figure 5: Feature visualization. For each appearance, we show the input image, VoVNet's PCA feature map, and DINOv3's PCA feature map. While DINOv3 extracts appearance-invariant semantic structures across style shifts, the VoVNet backbone exhibits large variations.
  • ...and 5 more figures