Table of Contents
Fetching ...

DINTR: Tracking via Diffusion-based Interpolation

Pha Nguyen, Ngan Le, Jackson Cothren, Alper Yilmaz, Khoa Luu

TL;DR

This work introduces DINTR, a diffusion-based framework for object tracking that operates in the visual domain and supports multiple indicator representations. It develops two diffusion-based temporal mechanisms—deterministic next-frame reconstruction and a faster interpolation-based approach—to model frame-to-frame correspondences while injecting target indications via conditioning. The method unifies tracking across point, box, segment, and text indications and demonstrates strong multiplicity across seven benchmarks. Results indicate competitive or state-of-the-art performance across several tasks, highlighting the potential of diffusion-based visual tracking with open-ended conditioning.

Abstract

Object tracking is a fundamental task in computer vision, requiring the localization of objects of interest across video frames. Diffusion models have shown remarkable capabilities in visual generation, making them well-suited for addressing several requirements of the tracking problem. This work proposes a novel diffusion-based methodology to formulate the tracking task. Firstly, their conditional process allows for injecting indications of the target object into the generation process. Secondly, diffusion mechanics can be developed to inherently model temporal correspondences, enabling the reconstruction of actual frames in video. However, existing diffusion models rely on extensive and unnecessary mapping to a Gaussian noise domain, which can be replaced by a more efficient and stable interpolation process. Our proposed interpolation mechanism draws inspiration from classic image-processing techniques, offering a more interpretable, stable, and faster approach tailored specifically for the object tracking task. By leveraging the strengths of diffusion models while circumventing their limitations, our Diffusion-based INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.

DINTR: Tracking via Diffusion-based Interpolation

TL;DR

This work introduces DINTR, a diffusion-based framework for object tracking that operates in the visual domain and supports multiple indicator representations. It develops two diffusion-based temporal mechanisms—deterministic next-frame reconstruction and a faster interpolation-based approach—to model frame-to-frame correspondences while injecting target indications via conditioning. The method unifies tracking across point, box, segment, and text indications and demonstrates strong multiplicity across seven benchmarks. Results indicate competitive or state-of-the-art performance across several tasks, highlighting the potential of diffusion-based visual tracking with open-ended conditioning.

Abstract

Object tracking is a fundamental task in computer vision, requiring the localization of objects of interest across video frames. Diffusion models have shown remarkable capabilities in visual generation, making them well-suited for addressing several requirements of the tracking problem. This work proposes a novel diffusion-based methodology to formulate the tracking task. Firstly, their conditional process allows for injecting indications of the target object into the generation process. Secondly, diffusion mechanics can be developed to inherently model temporal correspondences, enabling the reconstruction of actual frames in video. However, existing diffusion models rely on extensive and unnecessary mapping to a Gaussian noise domain, which can be replaced by a more efficient and stable interpolation process. Our proposed interpolation mechanism draws inspiration from classic image-processing techniques, offering a more interpretable, stable, and faster approach tailored specifically for the object tracking task. By leveraging the strengths of diffusion models while circumventing their limitations, our Diffusion-based INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.

Paper Structure

This paper contains 25 sections, 26 equations, 8 figures, 10 tables, 5 algorithms.

Figures (8)

  • Figure 1: Diffusion-based processes. (a) Probabilistic diffusion process ho2020denoising, where $q(\cdot)$ is noise sampling and $p_\theta(\cdot)$ is denoising. (b) Diffusion process in the 2D coordinate space chen2023diffusiondetluo2023diffusiontracklv2024diffmot. (c) A purely visual diffusion-based data prediction approach reconstructs the subsequent video frame. (d) Our proposed data interpolation approach interpolates between two consecutive video frames, indexed by timestamp $t$, allowing a seamless temporal transition for visual content understanding, temporal modeling, and instance extracting for the object tracking task across various indications (e).
  • Figure 2: Temporal Interpolation in
  • Figure : Inplace Reconstruction Finetuning
  • Figure : Correspondence Extraction
  • Figure B.3: The conditional LDMs utilizes U-Net ronneberger2015u blocks. First, a clean image $\mathbf{I}_k$ is converted to a noisy latent $\mathbf{z}_k$ via the noise sampling process $\mathcal{Q}(\cdot)$ (top branch). Then, well-structured regions are reconstructed from that extremely noisy input via the denoising/reconstruction process $\mathcal{P}_{\varepsilon_\theta}(\cdot)$ (bottom branch). Additionally, conditions can be added as indicators of the regions of interest. While the figure style is adapted from LDMs rombach2022high, we made a distinct change reflecting the injected sampling process, following Prompt-to-Prompt hertz2022prompt.
  • ...and 3 more figures