Table of Contents
Fetching ...

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

Pulkit Kumar, Namitha Padmanabhan, Luke Luo, Sai Saketh Rambhatla, Abhinav Shrivastava

TL;DR

This work tackles few-shot action recognition by decoupling motion and appearance. It introduces Trajectory-aligned Tokens (TATs) that fuse point trajectories from tracking with self-supervised DINOv2 patch tokens, aligned via a grid sampler and fed to a Masked Space-Time Transformer. A Bi-MHM-based set matching and cross-entropy loss drive episode-based learning, achieving state-of-the-art results across multiple benchmarks with reduced data and training requirements. The approach emphasizes efficiency—training only the transformer while leveraging offline trackers and self-supervised features—making it practical for real-world few-shot action recognition.

Abstract

We propose a simple yet effective approach for few-shot action recognition, emphasizing the disentanglement of motion and appearance representations. By harnessing recent progress in tracking, specifically point trajectories and self-supervised representation learning, we build trajectory-aligned tokens (TATs) that capture motion and appearance information. This approach significantly reduces the data requirements while retaining essential information. To process these representations, we use a Masked Space-time Transformer that effectively learns to aggregate information to facilitate few-shot action recognition. We demonstrate state-of-the-art results on few-shot action recognition across multiple datasets. Our project page is available at https://www.cs.umd.edu/~pulkit/tats

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

TL;DR

This work tackles few-shot action recognition by decoupling motion and appearance. It introduces Trajectory-aligned Tokens (TATs) that fuse point trajectories from tracking with self-supervised DINOv2 patch tokens, aligned via a grid sampler and fed to a Masked Space-Time Transformer. A Bi-MHM-based set matching and cross-entropy loss drive episode-based learning, achieving state-of-the-art results across multiple benchmarks with reduced data and training requirements. The approach emphasizes efficiency—training only the transformer while leveraging offline trackers and self-supervised features—making it practical for real-world few-shot action recognition.

Abstract

We propose a simple yet effective approach for few-shot action recognition, emphasizing the disentanglement of motion and appearance representations. By harnessing recent progress in tracking, specifically point trajectories and self-supervised representation learning, we build trajectory-aligned tokens (TATs) that capture motion and appearance information. This approach significantly reduces the data requirements while retaining essential information. To process these representations, we use a Masked Space-time Transformer that effectively learns to aggregate information to facilitate few-shot action recognition. We demonstrate state-of-the-art results on few-shot action recognition across multiple datasets. Our project page is available at https://www.cs.umd.edu/~pulkit/tats
Paper Structure (37 sections, 4 figures, 11 tables)

This paper contains 37 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Overview of our method. We take in video frames as input and extract point trajectories and DINO patch tokens using a Point Tracker and DINO respectively. These trajectories and tokens are then aligned using a grid sampler to form trajectory-aligned tokens (TATs). Finally, we pass the TATs through a masked space-time transformer and use a matching metric on the output embedding to predict the query action.
  • Figure 2: Effect of number of input frames under the 5-way 1-shot setting on SSv2-Full.
  • Figure 3: Quantitative analysis of 5-way 1-shot setting compared to MoLo. Top: Kinetics dataset; Middle: SSv2 Full dataset; Bottom: SSv2 Small dataset. "S" is a shorthand for "Something".
  • Figure 4: Qualitative Results. We showcase some results where our method performs better than baselines, attributed to the motion information that is clearly discernible in these samples. The examples are drawn from the Something-Something ssv2 dataset. The six columns depict the frames of each video sampled across time, and the lines denote the trajectories of tracked points. For visualization purposes, only the points on the most salient object are visualised while the background points are omitted.