Table of Contents
Fetching ...

Autogenic Language Embedding for Coherent Point Tracking

Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang

TL;DR

This work tackles long-range point tracking by introducing autogenic language embedding to enforce semantic consistency across frames. The proposed ALTracker learns text embeddings from visual features via a lightweight mapping network and employs a consistency decoder to integrate language cues into visual representations with minimal overhead. Automatic text token generation and a CLIP-inspired text encoder enable language-enhanced tracking without manual annotations. Experiments on PointOdyssey and TAP-Vid datasets show state-of-the-art improvements in tracking accuracy and robustness, demonstrating the practical value of language-informed visual consistency for long video sequences.

Abstract

Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. Our proposed method, termed autogenic language embedding for visual feature enhancement, strengthens point correspondence in long-term sequences. Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network, enabling seamless adaptation to various tracking tasks without explicit text annotations. Additionally, we introduce a consistency decoder that efficiently integrates text tokens into visual features with minimal computational overhead. Through enhanced visual consistency, our approach significantly improves tracking trajectories in lengthy videos with substantial appearance variations. Extensive experiments on widely-used tracking benchmarks demonstrate the superior performance of our method, showcasing notable enhancements compared to trackers relying solely on visual cues.

Autogenic Language Embedding for Coherent Point Tracking

TL;DR

This work tackles long-range point tracking by introducing autogenic language embedding to enforce semantic consistency across frames. The proposed ALTracker learns text embeddings from visual features via a lightweight mapping network and employs a consistency decoder to integrate language cues into visual representations with minimal overhead. Automatic text token generation and a CLIP-inspired text encoder enable language-enhanced tracking without manual annotations. Experiments on PointOdyssey and TAP-Vid datasets show state-of-the-art improvements in tracking accuracy and robustness, demonstrating the practical value of language-informed visual consistency for long video sequences.

Abstract

Point tracking is a challenging task in computer vision, aiming to establish point-wise correspondence across long video sequences. Recent advancements have primarily focused on temporal modeling techniques to improve local feature similarity, often overlooking the valuable semantic consistency inherent in tracked points. In this paper, we introduce a novel approach leveraging language embeddings to enhance the coherence of frame-wise visual features related to the same object. Our proposed method, termed autogenic language embedding for visual feature enhancement, strengthens point correspondence in long-term sequences. Unlike existing visual-language schemes, our approach learns text embeddings from visual features through a dedicated mapping network, enabling seamless adaptation to various tracking tasks without explicit text annotations. Additionally, we introduce a consistency decoder that efficiently integrates text tokens into visual features with minimal computational overhead. Through enhanced visual consistency, our approach significantly improves tracking trajectories in lengthy videos with substantial appearance variations. Extensive experiments on widely-used tracking benchmarks demonstrate the superior performance of our method, showcasing notable enhancements compared to trackers relying solely on visual cues.
Paper Structure (17 sections, 3 equations, 5 figures, 4 tables)

This paper contains 17 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualizing trajectories of tracked points. We visualize the object motion and compare the tracking trajectory between the baseline method relying solely on visual features (without autogenic language embedding) and our approach (with autogenic language embedding). Our method maintains the same structural framework as the baseline, differing only in the utilization of language-assisted consistency.
  • Figure 2: Visualization of semantic correspondence with various text prompts. The leftmost image is the source image with a set of key points; target images on the right part show correspondence results under various text prompts. We use circles to denote correctly-predicted points under the threshold $\alpha_{bbox}\leq0.1$ and crosses for incorrect matches.
  • Figure 3: The architecture of our ALTracker. We introduce a mapping network that aligns image features with corresponding mapped tokens to automatically obtain the text information. A consistency decoder is designed to jointly process textual and visual information, the text enhancement module refines text embedding with enhanced descriptive capabilities, and an image-text integration module integrates the enhanced text embeddings seamlessly into image features. Finally, the tracking result is obtained through any point tracker.
  • Figure 4: The architecture of the consistency decoder. Text enhancement module enriches text embeddings by integrating image embeddings into the attention mechanism. Text-image integration module combines enhanced text embeddings with image features to obtain the consistency feature.
  • Figure 5: Visualization of point trajectories on DAVISdavis. We compare the visualization result between our ALTracker with autogenic language embedding (w) and the baseline tracker without language information (w/o). The images show tracking results over time. Different colors indicate different points. We use circles to indicate correctly-predicted points under the threshold $\alpha_{bbox}\leq0.1$ and crosses for incorrect matches. Notably, our method yields accurate, coherent long-range motion even for fast moving (Bmx-trees), object deformation (Motocross), scale change (Soapbox), and similar distractor (Lab-coat) scenarios.