Table of Contents
Fetching ...

VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction

Muhua Zhu, Xinhao Jin, Yu Zhang, Yifei Xue, Tie Ji, Yizhen Lao

TL;DR

VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion with generative video diffusion, achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.

Abstract

Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.

VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction

TL;DR

VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion with generative video diffusion, achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.

Abstract

Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.
Paper Structure (15 sections, 9 equations, 8 figures, 2 tables)

This paper contains 15 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We study the task of video stabilization from a novel 3D perspective. Given an uncalibrated video, we first reconstruct the local 3D scene using a feed-forward model. We then employ a hybrid rendering strategy, fusing semantic and geometric cues to synthesize stabilized frames. Finally, a dual-stream video diffusion model performs full-frame completion and refinement, producing high-fidelity, temporally coherent video.
  • Figure 2: Comparison of video stabilization paradigms in rapid pure-rotation scene, characterized by extreme motions, blur, and severe cropping. (a) 2D methods: Significant information loss due to aggressive cropping. (b) Learning-based: Geometric distortions and temporal flickering. (c) Completion-based: Failure in maintaining structural integrity. (d) SfM-based: Pose estimation failure under geometric degeneracy. (e) Ours: Full-frame, high-fidelity stabilization with robust geometric and temporal consistency.
  • Figure 3: Overview of the VS3R pipeline.VS3R follows a "reconstruct-smooth-refine" paradigm: (1) Deep 3D Reconstruction (Sec. \ref{['subsec:scene reconstruction']}): Estimates camera parameters $g_t$, semantic masks $M_t$, and depth $D_t$ from uncalibrated video via a feed-forward model hu2025vggt4d. (2) Hybrid Stabilized Rendering (Sec. \ref{['subsec:HSR']}): Refines dynamic mask $CM_t$ by merging $M_t$ with geometric mask $FM_t$, then renders stabilized frames $S_t$ along the smoothed trajectory. (3) Full-frame Refinement (Sec. \ref{['subsec:DVDM']}): A dual-stream diffusion model restores disoccluded regions and rectifies artifacts to produce high-fidelity, temporally coherent frames $\hat{S}_t$.
  • Figure 4: Architecture of the dual-stream video diffusion model. To facilitate efficient fine-tuning, we freeze all network parameters except for the LoRA weights. Specifically, LoRA layers with a rank of 32 are integrated into every transformer block within the two DiT models.
  • Figure 5: Qualitative comparison of stabilized video on NUS liu2013bundled dataset. Our VS3R generates stabilized videos with significantly higher content, geometric, and temporal consistency compared to existing state-of-the-art methods, including DIFRINT DIFRINT, Rstab peng20243d, and GaVS you2025gavs, across various challenging scenarios.
  • ...and 3 more figures