Table of Contents
Fetching ...

Trajectory Densification and Depth from Perspective-based Blur

Tianchen Qiu, Qirun Zhang, Jiajian He, Zhengyue Zhuge, Jiahui Xu, Yueting Chen

TL;DR

The paper tackles depth estimation and dense camera-trajectory reconstruction from perspective-based blur in monocular video without stabilizers. It introduces a joint optical-depth pipeline that uses DINOv2 features and Cotracker for video information, a Transformer-based depth estimator with window-embedding, and a vision-language dense trajectory decoder. Two-stage training (depth then trajectory) and extensive evaluations on indoor, outdoor, and synthetic datasets show state-of-the-art depth accuracy and substantially denser trajectory reconstruction than traditional SfM approaches. The approach advances monocular video understanding by extracting metric depth and dense motion cues from long-exposure blur, with potential impact on stabilization, AR, and robotics.

Abstract

In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.

Trajectory Densification and Depth from Perspective-based Blur

TL;DR

The paper tackles depth estimation and dense camera-trajectory reconstruction from perspective-based blur in monocular video without stabilizers. It introduces a joint optical-depth pipeline that uses DINOv2 features and Cotracker for video information, a Transformer-based depth estimator with window-embedding, and a vision-language dense trajectory decoder. Two-stage training (depth then trajectory) and extensive evaluations on indoor, outdoor, and synthetic datasets show state-of-the-art depth accuracy and substantially denser trajectory reconstruction than traditional SfM approaches. The approach advances monocular video understanding by extracting metric depth and dense motion cues from long-exposure blur, with potential impact on stabilization, AR, and robotics.

Abstract

In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.

Paper Structure

This paper contains 13 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our method predicts metric depth map and dense trajectory within frames from perspective-based blur, which is caused by camera motion located on the curve of left above. In the trajectory densification part, the left column depicts the computed sparse trajectory, whereas the right column shows the predicted dense one; the blue solid line denotes the ground truth(GT).
  • Figure 2: An ideal optical system. Red point is on-axis point and blue points denote off-axis point. Under an identical camera state, scene points at different spatial locations exhibit different image deltas.
  • Figure 3: Overview of our pipeline. We begin by extracting multi-frame features with DINOdinov2 model, while employing off-the-shelf point trackercotracker to derive $\Delta$ of $N$ query points, from which we compute sparse trajectory $\Theta$ via optics-based algorithm. (a) Depth estimation. We segment the $T$-length features into window-size and further encode them with self-attention, followed by aggregation into the first window via cross-attention. (b) Dense trajectory decoder. Multi-frame features fused with the depth map $L$ are injected into tokenizeddeberta$\Theta$, resulting in the dense camera trajectory.
  • Figure 4: (a) Window-embed. Multi-frame features are windowed, concatenated along the channel dimension, and embeded with convolutional layer. (b) Output head. The head first reduces channels by half using a convolution, then upsamples via bilinear interpolation to the original resolution, and finally employs two convolution–activation stages to produce the depth map. (b) $\boldsymbol Cross-Window.$ Cross-attention is computed between two windows, where the refined post-window serves as the key–value to refine the pre-window.
  • Figure 5: Results of our pipeline, including depth estimation and trajectory reconstruction. Here, samples denotes the number of samples within a single frame of the reconstructed dense trajectory. In the 3D plot, we render the GT trajectory as a blue solid line, depict the sparse trajectory with red markers, and use distinct colors to represent the reconstructed trajectory for each individual frame.