Table of Contents
Fetching ...

Seurat: From Moving Points to Depth

Seokju Cho, Jiahui Huang, Seungryong Kim, Joon-Young Lee

TL;DR

This work tackles monocular depth estimation in videos by exploiting the temporal evolution of 2D point trajectories. It introduces Seurat, a two-branch Transformer framework that uses a dense grid of supporting trajectories to inform depth prediction for query points, with cross-attention enabling global motion context. Depth ratios are learned within sliding windows using a window-wise log-ratio loss and are fused with a metric-depth model to yield metric depths. On TAPVid-3D, Seurat achieves temporally smooth, high-accuracy depth across diverse domains, demonstrating strong zero-shot generalization from synthetic data to real-world video without stereo or multi-view data.

Abstract

Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.

Seurat: From Moving Points to Depth

TL;DR

This work tackles monocular depth estimation in videos by exploiting the temporal evolution of 2D point trajectories. It introduces Seurat, a two-branch Transformer framework that uses a dense grid of supporting trajectories to inform depth prediction for query points, with cross-attention enabling global motion context. Depth ratios are learned within sliding windows using a window-wise log-ratio loss and are fused with a metric-depth model to yield metric depths. On TAPVid-3D, Seurat achieves temporally smooth, high-accuracy depth across diverse domains, demonstrating strong zero-shot generalization from synthetic data to real-world video without stereo or multi-view data.

Abstract

Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.

Paper Structure

This paper contains 26 sections, 11 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Seurat predicts precise and smooth depth changes for dynamic objects over time by only looking at the 2D point trajectories, which encode depth cues in their motion patterns. The figure illustrates 2D point tracks lifted into 3D space with our depth predictions on videos from the DAVIS dataset pont20172017.
  • Figure 2: Motivation of our work.(a) By only looking at the tracked points, we can easily perceive that the object (here, a car) is moving away. (b) As a 3D object (here, a sphere) moves away from the camera, the pattern of its projected 2D points on the image plane changes, providing depth cues. In the initial frame (left), points are spaced farther apart on the image plane. As the object recedes (right), these 2D points converge toward the center, indicating increasing depth. This change in the density of projected points allows for inference of relative depth changes from motion in monocular video.
  • Figure 3: Overall architecture. We first use off-the-shelf point tracker karaev2024cotrackercho2024local to extract 2D trajectories of query points and a dense supporting grid, then, these trajectories are processed with a temporal and a spatial transformer in two separate branches. Motion information encoded by the supporting branch is injected into the query branch via cross-attention. Finally, two regression heads output ratio depths of both supporting and query trajectories.
  • Figure 4: Qualitative comparisons to baselines. We visualize 3D trajectories using the TAPVid-3D koppula2024tapvid benchmark. Compared to baselines that use combinations such as CoTracker with ZoeDepth, our model achieves superior depth smoothness and accuracy.