Table of Contents
Fetching ...

Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

Xuyang Chen, Conglang Zhang, Chuanheng Fu, Zihao Yang, Kaixuan Zhou, Yizhi Zhang, Jianan He, Yanfeng Zhang, Mingwei Sun, Zengmao Wang, Zhen Dong, Xiaoxiao Long, Liqiu Meng

TL;DR

Driving with DINO (DwD) tackles the sim-to-real video translation gap in autonomous driving by conditioning a diffusion model on DINOv3 Vision Foundation Model features to unify semantics and structure. The method introduces VFM-Prism with Spatial Resolution Enhancer, Minor Components Pruning (PCA with Random Channel Tail Drop), and a Causal Temporal Aggregator to balance realism with control precision. It uses a Cosmos-Predict2.5 backbone with a learnable Spatial Alignment Module to adapt high-resolution DINO features to the diffusion backbone and explicitly preserves historical context during frame integration. Empirical results on nuPlan/CARLA demonstrate state-of-the-art fidelity (sFID/sKID) and temporal stability (Motion/WarpSSIM) with strong Sim2Real consistency (mIoU), and show that the k and S hyperparameters enable controllable trade-offs.

Abstract

Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/

Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

TL;DR

Driving with DINO (DwD) tackles the sim-to-real video translation gap in autonomous driving by conditioning a diffusion model on DINOv3 Vision Foundation Model features to unify semantics and structure. The method introduces VFM-Prism with Spatial Resolution Enhancer, Minor Components Pruning (PCA with Random Channel Tail Drop), and a Causal Temporal Aggregator to balance realism with control precision. It uses a Cosmos-Predict2.5 backbone with a learnable Spatial Alignment Module to adapt high-resolution DINO features to the diffusion backbone and explicitly preserves historical context during frame integration. Empirical results on nuPlan/CARLA demonstrate state-of-the-art fidelity (sFID/sKID) and temporal stability (Motion/WarpSSIM) with strong Sim2Real consistency (mIoU), and show that the k and S hyperparameters enable controllable trade-offs.

Abstract

Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/
Paper Structure (29 sections, 12 figures, 3 tables)

This paper contains 29 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 2: Consistency-Realism-Dilemma
  • Figure 3: The framework of DwD. (a) Training: The model is trained on real-world driving videos using a controllable diffusion architecture. The core module, VFM-Prism, processes DINOv3 features through Spatial Resolution Enhancement, Minor Components Pruning, and Causal Temporal Aggregation. These refined features are injected via a Control Branch to guide the reconstruction of the original video. (b) Inference: The model performs Sim-to-Real translation using synthetic inputs. By leveraging the domain-invariant structural features extracted by VFM-Prism (specifically via PCA-based pruning to mitigate texture leakage), DwD generates high-fidelity photorealistic videos that strictly preserve the simulation's geometric layout.
  • Figure 4: Lower-dimensional PCA components encode coarse semantic layouts, whereas increasing the dimensions leads to the refinement of high-frequency details. Similarity maps are shown for two anchor points (top: $\times$ on building, bottom: $\times$ on one car).
  • Figure 5: Qualitative comparison with state-of-the-art methods. The red boxes highlight inconsistencies with respect to the CG input, while the green boxes point out low-fidelity textures.
  • Figure 6: The Spatial Alignment Module and Causal Temporal Aggregator.
  • ...and 7 more figures