DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Heyu Si; Brandon James Denis; Muyang Sun; Dragos Datcu; Yaoru Li; Xin Jin; Ruiju Fu; Yuliia Tatarinova; Federico Landi; Jie Song; Mingli Song; Qi Guo

DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Heyu Si, Brandon James Denis, Muyang Sun, Dragos Datcu, Yaoru Li, Xin Jin, Ruiju Fu, Yuliia Tatarinova, Federico Landi, Jie Song, Mingli Song, Qi Guo

Abstract

Recent advancements in 4D scene reconstruction, particularly those leveraging diffusion priors, have shown promise for novel view synthesis in autonomous driving. However, these methods often process frames independently or in a view-by-view manner, leading to a critical lack of spatio-temporal synergy. This results in spatial misalignment across cameras and temporal drift in sequences. We propose DriveFix, a novel multi-view restoration framework that ensures spatio-temporal coherence for driving scenes. Our approach employs an interleaved diffusion transformer architecture with specialized blocks to explicitly model both temporal dependencies and cross-camera spatial consistency. By conditioning the generation on historical context and integrating geometry-aware training losses, DriveFix enforces that the restored views adhere to a unified 3D geometry. This enables the consistent propagation of high-fidelity textures and significantly reduces artifacts. Extensive evaluations on the Waymo, nuScenes, and PandaSet datasets demonstrate that DriveFix achieves state-of-the-art performance in both reconstruction and novel view synthesis, marking a substantial step toward robust 4D world modeling for real-world deployment.

DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Abstract

Paper Structure (17 sections, 2 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Methodology
Training Dataset Construction
Spatio-Temporal Consistency Modeling
Structurally-Grounded Conditioning
Inference and Scene Enhancement
Experiments
Experimental Setup
Quantitative Results
Qualitative Analysis
Ablation Study
Conclusion
More Implementation Details
Dataset Details
...and 2 more sections

Figures (7)

Figure 1: Comparison of novel view synthesis results between baseline reconstruction methods and our DriveFix. DriveFix produces significantly cleaner renderings with sharper textures and fewer artifacts, particularly in distant regions and overlapping camera views.
Figure 2: Overview of the DriveFix framework. Given corrupted multi-view renderings from a base 4D simulator, our approach first constructs hybrid historical context by combining degraded and ground-truth frames from previous time steps. This context, along with the current corrupted frames and structural guidance such as depth and semantics, is fed into an interleaved diffusion transformer architecture. The model alternates between temporal attention blocks, which propagate high-fidelity textures from history to suppress flickering, and spatial attention blocks with camera-geometry embeddings to enforce cross-view geometric consistency across all synchronized cameras. The output is a set of spatio-temporally coherent surround-view frames, which serve as pseudo-ground-truth to re-optimize the underlying 4D scene representation for photorealistic novel view synthesis.
Figure 3: Qualitative comparison of scene restoration and novel view synthesis. We evaluate DriveFix against several state-of-the-art baselines, including HUGSIM, SplatAD, and Difix3D+, with ground truth provided as a reference. Red bounding boxes highlight challenging regions where baseline methods frequently exhibit severe artifacts, such as blurring, ghosting, and geometric distortions, particularly in distant objects and overlapping camera views. In contrast, DriveFix produces significantly cleaner renderings with sharper textures and maintains spatio-temporal continuity.
Figure 4: Qualitative ablation study of DriveFix on the Waymo Open Dataset. Each column presents the restoration results of a variant of our framework with a specific component removed. From left to right:(a) without cross-view spatial attention, (b) without temporal attention, (c) without historical context conditioning, (d) without geometry-aware alignment loss, and (e) without depth and semantic conditioning, (f) Full DriveFix model. The removal of any component leads to visible degradation, such as ghosting, blurring, or temporal inconsistency, highlighting the necessity of each module in achieving spatio-temporally coherent restoration.
Figure 5: Ablation study on the number of fine-tuning steps for the alignment loss. All metrics get optimal around 3000 iterations, after which further fine-tuning yields only marginal gains or slight degradation. This suggests that 3,000 iterations offer the optimal trade-off between geometric fidelity and generative diversity.
...and 2 more figures

DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Abstract

DriveFix: Spatio-Temporally Coherent Driving Scene Restoration

Authors

Abstract

Table of Contents

Figures (7)