Table of Contents
Fetching ...

No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

Tao Liu, Gang Wan, Kan Ren, Shibo Wen

TL;DR

Experiments show that the proposed new unsupervised framework for online video stabilization consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.

Abstract

We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.

No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

TL;DR

Experiments show that the proposed new unsupervised framework for online video stabilization consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.

Abstract

We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.
Paper Structure (80 sections, 41 equations, 20 figures, 7 tables)

This paper contains 80 sections, 41 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Limitations of existing methods and an overview of our approach. (a) Traditional pipelines rely on sparse, spatially biased keypoints, leading to inaccurate motion estimation and blur/distortion in stabilized frames. (b) Over-smoothing and fixed filtering ignore scene motion trends, resulting in residual jitter, black borders, and warping in the output video. (c) Our approach uses dense, evenly distributed keypoints together with accurate optical flow to propagate motion estimates across frames; it operates online without future frames, enabling real-time processing. (d) In the bottom-right chart, our method (red star) outperforms both traditional offline methods and recent online methods in stabilization quality, achieving state-of-the-art performance for online unsupervised stabilization.
  • Figure 2: Overview of our video stabilization framework, consisting of three key modules: motion estimation, motion propagation, and motion compensation. The motion estimation module detects keypoints and calculates optical flow to estimate motion between frames. The motion propagation module transfers the estimated motion information to a global trajectory buffer, ensuring motion consistency across frames. The motion compensation module adjusts the frames based on the updated trajectories to produce a stabilized output. A multi-threaded processing structure utilizes three threads, each dedicated to one of the core modules, and leverages shared buffer queues for efficient parallel execution. This approach improves processing speed and online performance, enabling the stabilization of video frames with minimal latency.
  • Figure 3: Comparison of visual results across different online methods. Distorted regions and black borders are highlighted with red boxes.
  • Figure 4: User study preferences.
  • Figure 5: Normalized ablation study results on five datasets (NUS, DeepStab, Selfile, GyRo, UAV-Test). The x-axis denotes ablation configurations A1--A9 (variants with modules removed/modified) and A10 (our full model with window length $L=7$). The y-axis shows normalized scores (higher is better) for five metrics: Consistency (C), Distortion Value (D), Peak Signal-to-Noise Ratio (PSNR), Stability Score (S), and Structural Similarity (SSIM).
  • ...and 15 more figures