Table of Contents
Fetching ...

ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition

Jiaming Zhou, Junwei Liang, Kun-Yu Lin, Jinrui Yang, Wei-Shi Zheng

TL;DR

This work tackles zero-shot action recognition by addressing semantic misalignment between videos and concise class descriptions. It introduces ActionHub, a large-scale dataset of 3.6 million action video descriptions for 1,211 actions, and the CoCo framework that fuses action definitions with rich video descriptions and enforces cross-action invariance through a cycle-consistency mechanism. Key innovations include dual cross-modality alignment and a cross-action invariance mining module, achieving state-of-the-art results on Kinetics-ZSAR, UCF101, and HMDB51 under intra-dataset and competitive performance under cross-dataset settings. The dataset and method collectively advance transferability to unseen actions and open-vocabulary action understanding, with practical impact for scalable action recognition in real-world scenarios.

Abstract

Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions. The text queries (class descriptions) used in existing ZSAR works, however, are often short action names that fail to capture the rich semantics in the videos, leading to misalignment. With the intuition that video content descriptions (e.g., video captions) can provide rich contextual information of visual concepts in videos, we propose to utilize human annotated video descriptions to enrich the semantics of the class descriptions of each action. However, all existing action video description datasets are limited in terms of the number of actions, the semantics of video descriptions, etc. To this end, we collect a large-scale action video descriptions dataset named ActionHub, which covers a total of 1,211 common actions and provides 3.6 million action video descriptions. With the proposed ActionHub dataset, we further propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module. Specifically, the Dual Cross-modality Alignment module utilizes both action labels and video descriptions from ActionHub to obtain rich class semantic features for feature alignment. The Cross-action Invariance Mining module exploits a cycle-reconstruction process between the class semantic feature spaces of seen actions and unseen actions, aiming to guide the model to learn cross-action invariant representations. Extensive experimental results demonstrate that our CoCo framework significantly outperforms the state-of-the-art on three popular ZSAR benchmarks (i.e., Kinetics-ZSAR, UCF101 and HMDB51) under two different learning protocols in ZSAR. We will release our code, models, and the proposed ActionHub dataset.

ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition

TL;DR

This work tackles zero-shot action recognition by addressing semantic misalignment between videos and concise class descriptions. It introduces ActionHub, a large-scale dataset of 3.6 million action video descriptions for 1,211 actions, and the CoCo framework that fuses action definitions with rich video descriptions and enforces cross-action invariance through a cycle-consistency mechanism. Key innovations include dual cross-modality alignment and a cross-action invariance mining module, achieving state-of-the-art results on Kinetics-ZSAR, UCF101, and HMDB51 under intra-dataset and competitive performance under cross-dataset settings. The dataset and method collectively advance transferability to unseen actions and open-vocabulary action understanding, with practical impact for scalable action recognition in real-world scenarios.

Abstract

Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions. The text queries (class descriptions) used in existing ZSAR works, however, are often short action names that fail to capture the rich semantics in the videos, leading to misalignment. With the intuition that video content descriptions (e.g., video captions) can provide rich contextual information of visual concepts in videos, we propose to utilize human annotated video descriptions to enrich the semantics of the class descriptions of each action. However, all existing action video description datasets are limited in terms of the number of actions, the semantics of video descriptions, etc. To this end, we collect a large-scale action video descriptions dataset named ActionHub, which covers a total of 1,211 common actions and provides 3.6 million action video descriptions. With the proposed ActionHub dataset, we further propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module. Specifically, the Dual Cross-modality Alignment module utilizes both action labels and video descriptions from ActionHub to obtain rich class semantic features for feature alignment. The Cross-action Invariance Mining module exploits a cycle-reconstruction process between the class semantic feature spaces of seen actions and unseen actions, aiming to guide the model to learn cross-action invariant representations. Extensive experimental results demonstrate that our CoCo framework significantly outperforms the state-of-the-art on three popular ZSAR benchmarks (i.e., Kinetics-ZSAR, UCF101 and HMDB51) under two different learning protocols in ZSAR. We will release our code, models, and the proposed ActionHub dataset.
Paper Structure (25 sections, 13 equations, 13 figures, 4 tables)

This paper contains 25 sections, 13 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The top of the figure shows a common framework for zero-shot action recognition, which maps the videos and class descriptions of actions into the corresponding feature spaces, and then learns an alignment between these two feature spaces. The class descriptions used in existing works are generally action names, action attributes, and action definitions. The semantics of these class-level descriptions are limited when matched with the rich semantics of those visual concepts in videos (e.g., action performer, objects, scenes), leading to the cross-modality diversity gap between videos and texts. The bottom of the figure shows two instances of action video descriptions for the action "kick ball", which provide abundant textual information correlated with the visual concepts in the videos of the action, thus the cross-modality diversity gap can be effectively alleviated.
  • Figure 2: Three actions, i.e., "golf", "picking fruit", and "tennis swing" from the proposed ActionHub dataset. For each action, we show three instances of action video descriptions, which provide rich contextual information about the visual concepts of videos of the action. Such rich textual semantics of actions in our proposed ActionHub dataset help ZSAR models better understand human actions.
  • Figure 3: The process of collecting the ActionHub dataset from the Internet. To collect a large-scale action video description dataset, we select a total of 1490 actions from seven popular video action datasets (i.e., Kinetics-700 carreira2019short, Moments-in-Time monfort2019moments, AVA gu2018ava, ActivityNet caba2015activitynet Olympic Sports niebles2010modeling, HMDB51 hmdb51, and UCF101 ucf101). After action deduplication, we obtain 1211 action queries. For each action, we use the action name as query to search videos from websites. The descriptions of videos in the returned video list (provided by the websites) are kept as the ActionHub dataset.
  • Figure 4: Left: the statistics on the number of actions with respect to different numbers of video descriptions. Right: the statistics on the number of actions with respect to different numbers of sentences. In the collected ActionHub dataset, nearly half of the action classes have less than 1,000 video descriptions and 3,000 sentences. (Zoom in for a better view)
  • Figure 5: The top-40 actions with the most descriptions and top-20 actions with the least descriptions, respectively. The actions with the most descriptions are more common in our daily life. And the actions with the least descriptions are rarely seen in real scenarios.
  • ...and 8 more figures