Decomposition Betters Tracking Everything Everywhere
Rui Li, Dong Liu
TL;DR
DecoMotion tackles long-range pixel tracking by decoupling video into static and dynamic content using two canonical 3D volumes, $G_{ ext{st}}$ and $G_{ ext{dy}}$, with tailored affine and invertible transformations and a fusion step to produce $G_{ ext{cb}}$. For dynamic objects, it augments color with discriminative temporal features via $f^{ ext{dy}}$, while static scenes leverage a simpler affine model with a static confidence $eta^{ ext{st}}$; the two representations are fused to render motion and appearance. The optimization combines data preprocessing from optical flow and multi-loss objectives (flow, photometric, and feature rendering) to guide both components and their fusion. Evaluations on TAP-Vid show substantial improvements in point-tracking accuracy over baselines such as OmniMotion, with additional gains from feature rendering and appearance decomposition, indicating strong robustness to occlusion and deformation and enabling downstream tasks like video inpainting.
Abstract
Recent studies on motion estimation have advocated an optimized motion representation that is globally consistent across the entire video, preferably for every pixel. This is challenging as a uniform representation may not account for the complex and diverse motion and appearance of natural videos. We address this problem and propose a new test-time optimization method, named DecoMotion, for estimating per-pixel and long-range motion. DecoMotion explicitly decomposes video content into static scenes and dynamic objects, either of which uses a quasi-3D canonical volume to represent. DecoMotion separately coordinates the transformations between local and canonical spaces, facilitating an affine transformation for the static scene that corresponds to camera motion. For the dynamic volume, DecoMotion leverages discriminative and temporally consistent features to rectify the non-rigid transformation. The two volumes are finally fused to fully represent motion and appearance. This divide-and-conquer strategy leads to more robust tracking through occlusions and deformations and meanwhile obtains decomposed appearances. We conduct evaluations on the TAP-Vid benchmark. The results demonstrate our method boosts the point-tracking accuracy by a large margin and performs on par with some state-of-the-art dedicated point-tracking solutions.
