Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving
Xuyang Chen, Conglang Zhang, Chuanheng Fu, Zihao Yang, Kaixuan Zhou, Yizhi Zhang, Jianan He, Yanfeng Zhang, Mingwei Sun, Zengmao Wang, Zhen Dong, Xiaoxiao Long, Liqiu Meng
TL;DR
Driving with DINO (DwD) tackles the sim-to-real video translation gap in autonomous driving by conditioning a diffusion model on DINOv3 Vision Foundation Model features to unify semantics and structure. The method introduces VFM-Prism with Spatial Resolution Enhancer, Minor Components Pruning (PCA with Random Channel Tail Drop), and a Causal Temporal Aggregator to balance realism with control precision. It uses a Cosmos-Predict2.5 backbone with a learnable Spatial Alignment Module to adapt high-resolution DINO features to the diffusion backbone and explicitly preserves historical context during frame integration. Empirical results on nuPlan/CARLA demonstrate state-of-the-art fidelity (sFID/sKID) and temporal stability (Motion/WarpSSIM) with strong Sim2Real consistency (mIoU), and show that the k and S hyperparameters enable controllable trade-offs.
Abstract
Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/
