Table of Contents
Fetching ...

SpatioTemporal Difference Network for Video Depth Super-Resolution

Zhengxue Wang, Yuan Wu, Xiang Li, Zhiqiang Yan, Jian Yang

TL;DR

This work tackles depth video super-resolution by addressing long-tailed distributions in both spatial non-smooth regions and temporal variation zones. It introduces STDNet, a two-branch network that learns spatial and temporal difference representations to guide RGB-D feature alignment and multi-frame fusion, complemented by a difference regularization loss. The spatial difference branch aligns RGB features to depth non-smooth regions, while the temporal difference branch propagates information from adjacent RGB-D frames to handle motion and temporal variability, yielding temporally consistent HR depth videos. Extensive experiments on TarTanAir, DyDToF, and DynamicReplica show state-of-the-art performance across multiple upscaling factors, with improved depth fidelity and temporal stability. The approach offers a principled way to leverage spatiotemporal long-tailed characteristics for robust depth video reconstruction with practical implications for 3D vision and AR/VR pipelines.

Abstract

Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

SpatioTemporal Difference Network for Video Depth Super-Resolution

TL;DR

This work tackles depth video super-resolution by addressing long-tailed distributions in both spatial non-smooth regions and temporal variation zones. It introduces STDNet, a two-branch network that learns spatial and temporal difference representations to guide RGB-D feature alignment and multi-frame fusion, complemented by a difference regularization loss. The spatial difference branch aligns RGB features to depth non-smooth regions, while the temporal difference branch propagates information from adjacent RGB-D frames to handle motion and temporal variability, yielding temporally consistent HR depth videos. Extensive experiments on TarTanAir, DyDToF, and DynamicReplica show state-of-the-art performance across multiple upscaling factors, with improved depth fidelity and temporal stability. The approach offers a principled way to leverage spatiotemporal long-tailed characteristics for robust depth video reconstruction with practical implications for 3D vision and AR/VR pipelines.

Abstract

Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

Paper Structure

This paper contains 16 sections, 13 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Quantitative comparisons between our STDNet and previous state-of-the-art methods on TarTanAir dataset.
  • Figure 2: Visualization of (a) RGB and GT depth at frame $t$, (b) absolute difference representations (top) and corresponding histogram distribution (bottom) between GT depth and bicubic-upsampled LR depth (Bic.). (c) shows the error analysis between consecutive frames ($t$ and $t-1$ bicubic-upsampled LR depth), while (d) presents cross frame results between frames $t$ and $t-2$.
  • Figure 3: Overview of STDNet. Given $\boldsymbol D_{LR}$, we first predict its spatial difference representation $\boldsymbol \sigma$. Then, $\boldsymbol D_{LR}$, $\boldsymbol I$, and $\boldsymbol \sigma$ are jointly fed into the spatial difference to enhance non-smooth regions, producing $\boldsymbol F_{sd}$. Next, we estimate the temporal difference representations for consecutive frames and cross frames, generating $\boldsymbol \varphi$ and $\widehat{\boldsymbol \varphi}$. These difference representations are used to propagate adjacent RGB and depth frames to the current depth frame, generating HR depth video $\boldsymbol D_{HR}$. Finally, a degradation regularization takes $\boldsymbol D_{HR}$, $\boldsymbol D_{GT}$, $\boldsymbol \sigma$, $\boldsymbol \varphi$, and $\widehat{\boldsymbol \varphi}$ as inputs to optimize the learning of spatiotemporal difference representations.
  • Figure 4: Details of (a) spatial difference, and (b) histogram comparison between our STDNet and DVSR sun2023consistent.
  • Figure 5: Details of (a) temporal difference, and (b) temporal consistency visualization for $x$–$t$ slices (along dashed line). Diff.: Difference. Orange rectangular boxes are the deformable convolutional layers zhu2019deformable.
  • ...and 5 more figures