Table of Contents
Fetching ...

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

Zihang Lai, Andrea Vedaldi

TL;DR

The paper addresses temporal inconsistency in video prediction by introducing Tracktention, a motion-aware transformer layer that uses pre-extracted point tracks to explicitly align temporal features. It comprises Attentional Sampling, Track Transformer, and Attentional Splatting, enabling image-based models to be upgraded into video-capable architectures with minimal modification, while preserving pre-trained weights via zero-initialized projections. Through experiments on video depth prediction and automatic colorization, Tracktention achieves state-of-the-art or competitive performance, delivering superior temporal coherence with modest parameter and computational overhead by leveraging existing trackers such as CoTracker3. The approach specializes in learning long-range, motion-informed correspondences, offering practical benefits for real-world video tasks without relying on heavy spatio-temporal attention, thereby improving both accuracy and efficiency in video analysis.

Abstract

Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

TL;DR

The paper addresses temporal inconsistency in video prediction by introducing Tracktention, a motion-aware transformer layer that uses pre-extracted point tracks to explicitly align temporal features. It comprises Attentional Sampling, Track Transformer, and Attentional Splatting, enabling image-based models to be upgraded into video-capable architectures with minimal modification, while preserving pre-trained weights via zero-initialized projections. Through experiments on video depth prediction and automatic colorization, Tracktention achieves state-of-the-art or competitive performance, delivering superior temporal coherence with modest parameter and computational overhead by leveraging existing trackers such as CoTracker3. The approach specializes in learning long-range, motion-informed correspondences, offering practical benefits for real-world video tasks without relying on heavy spatio-temporal attention, thereby improving both accuracy and efficiency in video analysis.

Abstract

Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.

Paper Structure

This paper contains 51 sections, 7 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Left: The Tracktention Layer is a plug-and-play module that can convert an image-based network (e.g., for monocular depth prediction) into a state-of-the-art video network (e.g., for video depth prediction). It does so by integrating the output of any off-the-shelf, modern, and powerful point trackers via track cross-attention. Right: For example, Tracktention achieves state-of-the-art and efficient video depth prediction by transforming Depth Anything into a video depth model. See \ref{['tab:depth_robustmvd']} for detailed results. ${}^\ast$Single-image models.
  • Figure 2: Overview of Tracktention. We begin by using an off-the-shelf point tracker to extract a number of video tracks. Given these, we first sample image tokens at the track locations, obtaining corresponding track tokens (\ref{['sec:attentional-sampling']}). Next, we use a Track Transformer to update these tokens, propagating information temporally at consistent spatial locations (\ref{['sec:track-transformer']}). Finally, we splat the information back to the image tokens (\ref{['sec:attentional-splatting']}). By explicitly incorporating motion information through point tracks, Tracktention improves temporal alignment, effectively captures complex object movements, and ensures stable feature representations over time.
  • Figure 3: Left: the Tracktention architecture comprises Attentional Sampling, pooling information from images to track, Track Transformer, processing this information temporally, and Attentional Splatting, moving the processed information back to the images. Right: Tracktention is easily integrated in ViTs and ConvNets to make video networks out image ones.
  • Figure 3: Quantitative comparison of video colorization methods on the DAVIS and Videvo datasets. Our method, when augmented onto four different baseline models, consistently improves the Color Distribution Consistency (CDC) metric across both datasets.
  • Figure 4: Video depth prediction, comparing Tracktention (+DepthAnything), DepthCrafter hu2024depthcrafter, and DUSt3R wang24dust3r:. We visualize a column of pixels (highlighted in red) over time to illustrate temporal variation. Our model shows stable, coherent depth estimation over time, while DepthCrafter exhibits significant errors in certain regions (blue box). DUSt3R struggles with dynamic content (green box).
  • ...and 10 more figures