Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions
Rui Zhang, Shuailong Li, Junxiao Xue, Feng Lin, Qing Zhang, Xiao Ma, Xiaoran Yan
TL;DR
This work introduces hierarchical video recognition and proposes a contrastive video-language framework (H-CLIP) that models inter-level dependencies across a taxonomy of actions. It combines multi-modal learning with hierarchical modeling (S2I/I2S units) and a top-down filter to enforce taxonomy coherence, achieving state-of-the-art results on a new FMA-based medical dataset with four items and three scores per item. The approach yields substantial gains over flat baselines, especially in fine-grained C^2 predictions, and demonstrates strong zero-/few-shot transfer, underscoring the value of hierarchical structuring for real-world medical video understanding. The work also provides a curated dataset and analyses that highlight the potential social impact in healthcare and beyond, while outlining limitations and avenues for future research in scalable, robust hierarchical video-language systems.
Abstract
Video recognition remains an open challenge, requiring the identification of diverse content categories within videos. Mainstream approaches often perform flat classification, overlooking the intrinsic hierarchical structure relating categories. To address this, we formalize the novel task of hierarchical video recognition, and propose a video-language learning framework tailored for hierarchical recognition. Specifically, our framework encodes dependencies between hierarchical category levels, and applies a top-down constraint to filter recognition predictions. We further construct a new fine-grained dataset based on medical assessments for rehabilitation of stroke patients, serving as a challenging benchmark for hierarchical recognition. Through extensive experiments, we demonstrate the efficacy of our approach for hierarchical recognition, significantly outperforming conventional methods, especially for fine-grained subcategories. The proposed framework paves the way for hierarchical modeling in video understanding tasks, moving beyond flat categorization.
