Table of Contents
Fetching ...

Decomposition Betters Tracking Everything Everywhere

Rui Li, Dong Liu

TL;DR

DecoMotion tackles long-range pixel tracking by decoupling video into static and dynamic content using two canonical 3D volumes, $G_{ ext{st}}$ and $G_{ ext{dy}}$, with tailored affine and invertible transformations and a fusion step to produce $G_{ ext{cb}}$. For dynamic objects, it augments color with discriminative temporal features via $f^{ ext{dy}}$, while static scenes leverage a simpler affine model with a static confidence $eta^{ ext{st}}$; the two representations are fused to render motion and appearance. The optimization combines data preprocessing from optical flow and multi-loss objectives (flow, photometric, and feature rendering) to guide both components and their fusion. Evaluations on TAP-Vid show substantial improvements in point-tracking accuracy over baselines such as OmniMotion, with additional gains from feature rendering and appearance decomposition, indicating strong robustness to occlusion and deformation and enabling downstream tasks like video inpainting.

Abstract

Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions.

Decomposition Betters Tracking Everything Everywhere

TL;DR

DecoMotion tackles long-range pixel tracking by decoupling video into static and dynamic content using two canonical 3D volumes, and , with tailored affine and invertible transformations and a fusion step to produce . For dynamic objects, it augments color with discriminative temporal features via , while static scenes leverage a simpler affine model with a static confidence ; the two representations are fused to render motion and appearance. The optimization combines data preprocessing from optical flow and multi-loss objectives (flow, photometric, and feature rendering) to guide both components and their fusion. Evaluations on TAP-Vid show substantial improvements in point-tracking accuracy over baselines such as OmniMotion, with additional gains from feature rendering and appearance decomposition, indicating strong robustness to occlusion and deformation and enabling downstream tasks like video inpainting.

Abstract

Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions.
Paper Structure (36 sections, 14 equations, 7 figures, 3 tables)

This paper contains 36 sections, 14 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of the proposed decoupled representation. We explicitly define two separate 3D canonical volumes $G_{\text{dy}}$ and $G_{\text{st}}$ to respectively characterize dynamic objects and static scenes in videos. In each representation, in addition to color and density, the $G_{\text{dy}}$ encodes the feature $f^{\text{dy}}$ to better represent dynamic objects, and feature rendering loss is further proposed to rectify the non-rigid transformation. The $G_{\text{st}}$ stores the $\beta^{st}$ determining the confidence of being a static point. In each canonical volume, we carefully design transformation functions $\mathcal{T}^{\text{dy}}$ (solid line) and $\mathcal{T}^{\text{st}}$ (dash line) to map each local 3D point $x_i^k$ along the ray of the point $p_i$ to $u_j^k, v_j^k$ in another local 3D volume $L_j$. In order to render the 2D correspondence $\hat{p}_j$ for $p_i$, we get the canonical volume $G_{\mathrm{cb}}$ of final representation by volume fusion with $\beta^{st}$. Set of 3D points $\{u_j^k\}_{k=1}^K,\{v_j^k\}_{k=1}^K$ mapped from $\{x_i^k\}_{k=1}^K$ are aggregated by alpha compositing in the $G_{\mathrm{cb}}$ (see Eq.(\ref{['eq:ax']})), and are projected to the image plane.
  • Figure 2: The illustration of feature rendering loss. We also show the qualitative results in the last row where the query points are marked with different colors. Please refer to the corresponding video in the supplemental materials. (Zoom in for best view)
  • Figure 3: The rendering results for static scenes and dynamic objects. Only optimized with $\mathcal{L}^{\text{cb}}$, we still observe the decomposition in terms of motion and appearance.
  • Figure 4: The ablation study for the transformations. The $n_{\text{st}},n_{\text{dy}}$ represent the number of non-linear layers dinh2016density used in static and dynamic transformations.
  • Figure 5: Qualitative results for point tracking. Given the query points marked with different colors in the first frame (blue border), we visualize the visible correspondences in randomly sampled frames. Please refer to more videos in the supplemental materials. (Zoom in for best view)
  • ...and 2 more figures