Table of Contents
Fetching ...

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam, Vincent Tao Hu, Björn Ommer

TL;DR

ActAlign tackles zero-shot fine-grained video classification by leveraging a large language model to generate an ordered sequence of sub-actions for each candidate class and then aligning these textual sequences to frame-wise video embeddings using Dynamic Time Warping in a shared SigLIP embedding space; the approach operates without any video–text training data. The method formalizes the prediction as $\hat{y}_i = \arg\max_{c_m \in \mathcal{Y}} \hat{\gamma}_{i,m}$, where $\hat{\gamma}_{i,m}$ reflects temporal alignment quality between frames and sub-actions after smoothing and affinity transformation. On ActionAtlas, ActAlign achieves Top-1 $30.40\%$, Top-2 $53.01\%$, and Top-3 $70.27\%$, surpassing baselines including billion-parameter VLMs while using ~8× fewer parameters, thanks to context-enhanced sub-actions and DTW-based temporal structure. This work demonstrates that structured language priors, combined with classical sequence alignment, can unlock open-set, fine-grained video understanding in a training-free, domain-general framework.

Abstract

We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.

ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

TL;DR

ActAlign tackles zero-shot fine-grained video classification by leveraging a large language model to generate an ordered sequence of sub-actions for each candidate class and then aligning these textual sequences to frame-wise video embeddings using Dynamic Time Warping in a shared SigLIP embedding space; the approach operates without any video–text training data. The method formalizes the prediction as , where reflects temporal alignment quality between frames and sub-actions after smoothing and affinity transformation. On ActionAtlas, ActAlign achieves Top-1 , Top-2 , and Top-3 , surpassing baselines including billion-parameter VLMs while using ~8× fewer parameters, thanks to context-enhanced sub-actions and DTW-based temporal structure. This work demonstrates that structured language priors, combined with classical sequence alignment, can unlock open-set, fine-grained video understanding in a training-free, domain-general framework.

Abstract

We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.

Paper Structure

This paper contains 64 sections, 15 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: ActAlign improves zero-shot fine-grained action recognition by modeling them as structured language sequences. By aligning sub-action descriptions with video frames (green vs. red paths), we achieve more accurate predictions without requiring any video-text training data.
  • Figure 2: Our ActAlign Method Overview. (1) Sub-action Generation: Given fine-grained actions (e.g. Basketball Tactics), we prompt an LLM to decompose each action (e.g. Hookshot, JumpShot, Dunk) into a temporal sequence of sub-actions. (2) Temporal Alignment: Video frames are encoded by a frozen pretrained vision encoder and smoothed via a moving‐average filter. Simultaneously, each sub-action is encoded by the text encoder. We compute a cosine‐similarity matrix between frame and sub-action embeddings, then apply Dynamic Time Warping (DTW) to find the optimal alignment path and normalized alignment score. (3) Class Prediction: We repeat this process for each candidate action m, compare normalized alignment scores $\hat{\gamma}_{\text{video}, m}$, and select the action sequence with the highest score as the final prediction.
  • Figure 3: t-SNE visualization of sub-action embeddings. Each color corresponds to a sport domain. Augmenting sub-actions with context yields more discriminative clusters and improves textual grounding.
  • Figure 4: DTW alignment paths for an incorrect prediction (left) versus a correct classification (right). The correct class exhibits clearer segmentation and higher alignment quality. The sub-action scripts are provided in Table \ref{['tab:example_scripts']}.
  • Figure 5: Signal smoothing reduces high-frequency noise and enhances transition between sub-actions. Similarity matrices before and after applying a moving-average filter ($w=30$).
  • ...and 4 more figures