ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Amir Aghdam, Vincent Tao Hu, Björn Ommer
TL;DR
ActAlign tackles zero-shot fine-grained video classification by leveraging a large language model to generate an ordered sequence of sub-actions for each candidate class and then aligning these textual sequences to frame-wise video embeddings using Dynamic Time Warping in a shared SigLIP embedding space; the approach operates without any video–text training data. The method formalizes the prediction as $\hat{y}_i = \arg\max_{c_m \in \mathcal{Y}} \hat{\gamma}_{i,m}$, where $\hat{\gamma}_{i,m}$ reflects temporal alignment quality between frames and sub-actions after smoothing and affinity transformation. On ActionAtlas, ActAlign achieves Top-1 $30.40\%$, Top-2 $53.01\%$, and Top-3 $70.27\%$, surpassing baselines including billion-parameter VLMs while using ~8× fewer parameters, thanks to context-enhanced sub-actions and DTW-based temporal structure. This work demonstrates that structured language priors, combined with classical sequence alignment, can unlock open-set, fine-grained video understanding in a training-free, domain-general framework.
Abstract
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.
