Table of Contents
Fetching ...

Track Everything Everywhere Fast and Robustly

Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

TL;DR

A novel invertible deformation network is introduced, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions to improve efficiency and robustness.

Abstract

We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.

Track Everything Everywhere Fast and Robustly

TL;DR

A novel invertible deformation network is introduced, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions to improve efficiency and robustness.

Abstract

We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.
Paper Structure (24 sections, 12 equations, 9 figures, 5 tables)

This paper contains 24 sections, 12 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our optimization-based approach achieves fast and robust long-term tracking
  • Figure 2: Method Overview: To track a query pixel $p_i$, we first lift the pixel to 3D with an optimizable depth map (Sec. \ref{['sec:method_depth']}). The 3D point is deformed into the shared canonical space and back to another time frame $j$ with a novel efficient and expressive invertible deformation field $\mathcal{T}$ (Sec. \ref{['sec:method_local_nvp']}). The depth maps and the deformation $\mathcal{T}$ are optimized with both short-term dense RAFT teed2020raft optical flow and long-term sparse DINOv2 oquab2023dinov2 correspondence (Sec. \ref{['sec:method_dino']}).
  • Figure 3: Architecture of CaDeX++ (right). The deformation network has a stack of coupling blocks and gradually changes one coordinate dimension per block (For difference Sec. \ref{['sec:method_local_nvp']}).
  • Figure 4: Filtered long-range semantic correspondences based on DINOv2 oquab2023dinov2.
  • Figure 5: We compare the tracking performance our method with TAPIR doersch2023tapir, Cotracker karaev2023cotracker and OmniMotion wang2023tracking on DAVIS scenes dogs-jump, bmx-trees, and parkour from top to bottom. The leftmost column shows the initial query points. Our method performs better on these scenes than the other method.
  • ...and 4 more figures