Track Everything Everywhere Fast and Robustly

Yunzhou Song; Jiahui Lei; Ziyun Wang; Lingjie Liu; Kostas Daniilidis

Track Everything Everywhere Fast and Robustly

Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

TL;DR

A novel invertible deformation network is introduced, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions to improve efficiency and robustness.

Abstract

We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.

Track Everything Everywhere Fast and Robustly

TL;DR

Abstract

Paper Structure (24 sections, 12 equations, 9 figures, 5 tables)

This paper contains 24 sections, 12 equations, 9 figures, 5 tables.

Introduction
Related Work
Method
Preliminaries
CaDeX++: Non-linear and Local Invertible NVPs
Optimization with Depth Prior
Incorporation of Long-term Semantics
Training and Inference
Experiments
Experiment Setup
Comparison with SoTA Methods
Baselines
Quantitative comparisons
Qualitative comparison
Ablation Study
...and 9 more sections

Figures (9)

Figure 1: Our optimization-based approach achieves fast and robust long-term tracking
Figure 2: Method Overview: To track a query pixel $p_i$, we first lift the pixel to 3D with an optimizable depth map (Sec. \ref{['sec:method_depth']}). The 3D point is deformed into the shared canonical space and back to another time frame $j$ with a novel efficient and expressive invertible deformation field $\mathcal{T}$ (Sec. \ref{['sec:method_local_nvp']}). The depth maps and the deformation $\mathcal{T}$ are optimized with both short-term dense RAFT teed2020raft optical flow and long-term sparse DINOv2 oquab2023dinov2 correspondence (Sec. \ref{['sec:method_dino']}).
Figure 3: Architecture of CaDeX++ (right). The deformation network has a stack of coupling blocks and gradually changes one coordinate dimension per block (For difference Sec. \ref{['sec:method_local_nvp']}).
Figure 4: Filtered long-range semantic correspondences based on DINOv2 oquab2023dinov2.
Figure 5: We compare the tracking performance our method with TAPIR doersch2023tapir, Cotracker karaev2023cotracker and OmniMotion wang2023tracking on DAVIS scenes dogs-jump, bmx-trees, and parkour from top to bottom. The leftmost column shows the initial query points. Our method performs better on these scenes than the other method.
...and 4 more figures

Track Everything Everywhere Fast and Robustly

TL;DR

Abstract

Track Everything Everywhere Fast and Robustly

Authors

TL;DR

Abstract

Table of Contents

Figures (9)