Table of Contents
Fetching ...

Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions

Rui Zhang, Shuailong Li, Junxiao Xue, Feng Lin, Qing Zhang, Xiao Ma, Xiaoran Yan

TL;DR

This work introduces hierarchical video recognition and proposes a contrastive video-language framework (H-CLIP) that models inter-level dependencies across a taxonomy of actions. It combines multi-modal learning with hierarchical modeling (S2I/I2S units) and a top-down filter to enforce taxonomy coherence, achieving state-of-the-art results on a new FMA-based medical dataset with four items and three scores per item. The approach yields substantial gains over flat baselines, especially in fine-grained C^2 predictions, and demonstrates strong zero-/few-shot transfer, underscoring the value of hierarchical structuring for real-world medical video understanding. The work also provides a curated dataset and analyses that highlight the potential social impact in healthcare and beyond, while outlining limitations and avenues for future research in scalable, robust hierarchical video-language systems.

Abstract

Video recognition remains an open challenge, requiring the identification of diverse content categories within videos. Mainstream approaches often perform flat classification, overlooking the intrinsic hierarchical structure relating categories. To address this, we formalize the novel task of hierarchical video recognition, and propose a video-language learning framework tailored for hierarchical recognition. Specifically, our framework encodes dependencies between hierarchical category levels, and applies a top-down constraint to filter recognition predictions. We further construct a new fine-grained dataset based on medical assessments for rehabilitation of stroke patients, serving as a challenging benchmark for hierarchical recognition. Through extensive experiments, we demonstrate the efficacy of our approach for hierarchical recognition, significantly outperforming conventional methods, especially for fine-grained subcategories. The proposed framework paves the way for hierarchical modeling in video understanding tasks, moving beyond flat categorization.

Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions

TL;DR

This work introduces hierarchical video recognition and proposes a contrastive video-language framework (H-CLIP) that models inter-level dependencies across a taxonomy of actions. It combines multi-modal learning with hierarchical modeling (S2I/I2S units) and a top-down filter to enforce taxonomy coherence, achieving state-of-the-art results on a new FMA-based medical dataset with four items and three scores per item. The approach yields substantial gains over flat baselines, especially in fine-grained C^2 predictions, and demonstrates strong zero-/few-shot transfer, underscoring the value of hierarchical structuring for real-world medical video understanding. The work also provides a curated dataset and analyses that highlight the potential social impact in healthcare and beyond, while outlining limitations and avenues for future research in scalable, robust hierarchical video-language systems.

Abstract

Video recognition remains an open challenge, requiring the identification of diverse content categories within videos. Mainstream approaches often perform flat classification, overlooking the intrinsic hierarchical structure relating categories. To address this, we formalize the novel task of hierarchical video recognition, and propose a video-language learning framework tailored for hierarchical recognition. Specifically, our framework encodes dependencies between hierarchical category levels, and applies a top-down constraint to filter recognition predictions. We further construct a new fine-grained dataset based on medical assessments for rehabilitation of stroke patients, serving as a challenging benchmark for hierarchical recognition. Through extensive experiments, we demonstrate the efficacy of our approach for hierarchical recognition, significantly outperforming conventional methods, especially for fine-grained subcategories. The proposed framework paves the way for hierarchical modeling in video understanding tasks, moving beyond flat categorization.
Paper Structure (38 sections, 17 equations, 7 figures, 6 tables)

This paper contains 38 sections, 17 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An illustrated example of a stroke patient undergoing motor recovery assessment in a hierarchical video recognition problem.
  • Figure 2: An overview of our proposed framework. We input videos, annotations, and descriptions of hierarchical categories and output the recognition results at each category level by two modules: multi-modal learning and hierarchical modeling , which are designed for facilitating knowledge learning at each level and enabling interaction among different levels of knowledge respectively.
  • Figure 3: An illustration of S2I unit (left) and I2S unit (right).
  • Figure 4: Zero-/few-shot results on FMA dataset under 8 video frames with ViT-B/32 as the backbone. Two baselines are only capable of performing video recognition at the $C^2$ level, as they lack the ability to support hierarchical recognition.
  • Figure 5: Recognition results of different backbones under 8 video frames as the visual input and different number of frames under ViT-B/32 as the backbone model.
  • ...and 2 more figures