Table of Contents
Fetching ...

3D Multi-frame Fusion for Video Stabilization

Zhan Peng, Xinyi Ye, Weiyue Zhao, Tianqi Liu, Huiqiang Sun, Baopu Li, Zhiguo Cao

TL;DR

RStab tackles the challenge of stabilizing videos while maintaining full-frame coverage and geometric structure. It introduces Stabilized Rendering (SR), a 3D volume-rendering fusion of multiple frames, augmented by Adaptive Ray Range (ARR) using depth priors and Color Correction (CC) with optical-flow-guided color aggregation. The method leverages epipolar constraints and depth-aware ray sampling to reduce dynamics-induced artifacts and recover occluded content. Experiments on NUS, Selfie, and DeepStab demonstrate state-of-the-art performance in field of view, image quality, and stability across datasets, with robust handling of dynamic regions and parallax.

Abstract

In this paper, we present RStab, a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods, we introduce a 3D multi-frame perspective to generate stabilized images, addressing the challenge of full-frame generation while preserving structure. The core of our approach lies in Stabilized Rendering (SR), a volume rendering module, which extends beyond the image fusion by incorporating feature fusion. The core of our RStab framework lies in Stabilized Rendering (SR), a volume rendering module, fusing multi-frame information in 3D space. Specifically, SR involves warping features and colors from multiple frames by projection, fusing them into descriptors to render the stabilized image. However, the precision of warped information depends on the projection accuracy, a factor significantly influenced by dynamic regions. In response, we introduce the Adaptive Ray Range (ARR) module to integrate depth priors, adaptively defining the sampling range for the projection process. Additionally, we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules, our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV), image quality, and video stability across various datasets.

3D Multi-frame Fusion for Video Stabilization

TL;DR

RStab tackles the challenge of stabilizing videos while maintaining full-frame coverage and geometric structure. It introduces Stabilized Rendering (SR), a 3D volume-rendering fusion of multiple frames, augmented by Adaptive Ray Range (ARR) using depth priors and Color Correction (CC) with optical-flow-guided color aggregation. The method leverages epipolar constraints and depth-aware ray sampling to reduce dynamics-induced artifacts and recover occluded content. Experiments on NUS, Selfie, and DeepStab demonstrate state-of-the-art performance in field of view, image quality, and stability across datasets, with robust handling of dynamic regions and parallax.

Abstract

In this paper, we present RStab, a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods, we introduce a 3D multi-frame perspective to generate stabilized images, addressing the challenge of full-frame generation while preserving structure. The core of our approach lies in Stabilized Rendering (SR), a volume rendering module, which extends beyond the image fusion by incorporating feature fusion. The core of our RStab framework lies in Stabilized Rendering (SR), a volume rendering module, fusing multi-frame information in 3D space. Specifically, SR involves warping features and colors from multiple frames by projection, fusing them into descriptors to render the stabilized image. However, the precision of warped information depends on the projection accuracy, a factor significantly influenced by dynamic regions. In response, we introduce the Adaptive Ray Range (ARR) module to integrate depth priors, adaptively defining the sampling range for the projection process. Additionally, we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules, our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV), image quality, and video stability across various datasets.
Paper Structure (12 sections, 9 equations, 9 figures, 2 tables)

This paper contains 12 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Existing dilemmas and our method. (a) and (b) exhibit cropping issues, characteristic of single-frame methods. (a) and (c) encounter difficulties in preserving structure, inherent in 2D-based approaches. Fortunately, our proposed method (d) not only mitigates distortion and artifacts but also maintains no-cropping stabilized frames.
  • Figure 2: Overview of our framework. (1) Given input frames ${\{\mathbf{I}_t\}}_{t=1}^{N}$ with a shaky trajectory ${\{\mathbf{P}_t\}}_{t=1}^{N}$, our purpose lies in rendering stabilized video sequence ${\{\tilde{\mathbf{I}}_t\}}_{t=1}^{N}$ with smoothed trajectory ${\{\tilde{\mathbf{P}}_t\}}_{t=1}^{N}$. Here, the input trajectories ${\{\mathbf{P}_t\}}_{t=1}^{N}$ derive from preprocessing, while the smoothed trajectories ${\{\tilde{\mathbf{P}}_t\}}_{t=1}^{N}$ are generated using a Trajectory Smoothing module. (2) In addition to ${\{\mathbf{P}_t\}}_{t=1}^{N}$, depth maps ${\{\mathbf{D}_t\}}_{t=1}^{N}$ and optical flow $\{\mathbf{F}_{t}\}_{t=1}^{N}$ can be obtained during preprocessing. We aggregate ${\{\mathbf{D}_t\}}_{t=1}^{N}$ into the ray range ${\{\tilde{\mathbf{R}}_{t}\}}_{t=1}^{N}$ using the Adaptive Ray Range module. The ray range ${\{\tilde{\mathbf{R}}_{t}\}}_{t=1}^{N}$, along with $\{\mathbf{F}_{t}\}_{t=1}^{N}$ and the smoothed trajectory ${\{\tilde{\mathbf{P}}_t\}}_{t=1}^{N}$, serves as inputs to the Stabilized Rendering module. Conducting Stabilized Rendering, enhanced by the Color Correction module, we fuse the input frames ${\{\mathbf{I}_t\}}_{t=1}^{N}$ and their features ${\{\bm{\mathcal{F}}_t\}}_{t=1}^{N}$ to render the stabilized video sequence ${\{\tilde{\mathbf{I}}_t\}}_{t=1}^{N}$.
  • Figure 3: Illustration of depth projection and splatting. Left: The depth projection involve lifting a pixel $\mathbf{x}_t$ to $3$D space using the estimated depth $\mathbf{D}_t(\mathbf{x}_t)$ and projecting to the sub-pixel $\tilde{\mathbf{x}}$. The depth of $\tilde{\mathbf{x}}$ can be calculated and denoted as $\tilde{\mathbf{D}}_t(\tilde{\mathbf{x}})$. Right: As $\tilde{\mathbf{x}}$ is not precisely projected onto a pixel coordinate, we convert its depth to adjacent pixels, e.g. $\tilde{\mathbf{x}}_p$, with a distance-associated weight $\omega_t$.
  • Figure 4: The effect of temporal weights. The introduction of temporal weights can mitigate distortion.
  • Figure 5: Illustration of Color Correction module. Firstly, we project a pixel $\tilde{\mathbf{x}}_{T}$ from the target stabilized frame onto corresponding ${\mathbf{x}}_{T}$ of the input frame at the same timestamp $T$. Secondly, we obtain feature matching of ${\mathbf{x}}_{T}$ in the input frame at timestamps $t$ using optical flow $\mathbf{F}_{{T}\rightarrow t}(\mathbf{x}_{T})$. As geometric constraints alone are insufficient for modeling dynamic regions, we aggregate precise color by correcting the geometric projected position $\mathbf{x}_t$ to the optical-flow refined position $\mathbf{x}'_{t}$.
  • ...and 4 more figures