Table of Contents
Fetching ...

Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition

Juefeng Xiao, Tianqi Xiang, Zhigang Tu

TL;DR

The paper tackles attribute-bias in multi-attribute, few-shot action recognition by introducing Adaptive Attribute Prototype Model (AAPM). It leverages a Text-Constrain Module (TCM) to ground visual prototypes in textual attribute semantics and an Attribute Assignment Method (AAM) to improve robustness against training bias, all while keeping encoders frozen to preserve generalization. AAPM achieves state-of-the-art results on both attribute-based multi-label few-shot learning (AMFAR) and single-label few-shot action recognition, validated on the newly proposed Multi-Kinetics dataset. The work provides a practical framework for leveraging textual guidance to construct reliable, attribute-aware prototypes with strong generalization across diverse attributes and tasks.

Abstract

In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior. However, using a single model to simultaneously recognize multiple attributes can lead to a decrease in accuracy. In this work, we propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information and strikes a balance between accuracy and robustness. Firstly, we introduce the Text-Constrain Module (TCM) to incorporate textual information from potential labels, and constrain the construction of different attributes prototype representations. In addition, we explore the Attribute Assignment Method (AAM) to address the issue of training bias and increase robustness during the training process.Furthermore, we construct a new video dataset with attribute-based multi-label called Multi-Kinetics for evaluation, which contains various attribute labels (e.g. action, scene, object, etc.) related to human behavior. Extensive experiments demonstrate that our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition. The project and dataset are available at an anonymous account https://github.com/theAAPM/AAPM

Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition

TL;DR

The paper tackles attribute-bias in multi-attribute, few-shot action recognition by introducing Adaptive Attribute Prototype Model (AAPM). It leverages a Text-Constrain Module (TCM) to ground visual prototypes in textual attribute semantics and an Attribute Assignment Method (AAM) to improve robustness against training bias, all while keeping encoders frozen to preserve generalization. AAPM achieves state-of-the-art results on both attribute-based multi-label few-shot learning (AMFAR) and single-label few-shot action recognition, validated on the newly proposed Multi-Kinetics dataset. The work provides a practical framework for leveraging textual guidance to construct reliable, attribute-aware prototypes with strong generalization across diverse attributes and tasks.

Abstract

In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior. However, using a single model to simultaneously recognize multiple attributes can lead to a decrease in accuracy. In this work, we propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information and strikes a balance between accuracy and robustness. Firstly, we introduce the Text-Constrain Module (TCM) to incorporate textual information from potential labels, and constrain the construction of different attributes prototype representations. In addition, we explore the Attribute Assignment Method (AAM) to address the issue of training bias and increase robustness during the training process.Furthermore, we construct a new video dataset with attribute-based multi-label called Multi-Kinetics for evaluation, which contains various attribute labels (e.g. action, scene, object, etc.) related to human behavior. Extensive experiments demonstrate that our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition. The project and dataset are available at an anonymous account https://github.com/theAAPM/AAPM

Paper Structure

This paper contains 24 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The performance of different methods in constructing visual prototypes for action recognition when attributes increase. The results demonstrate that the performance of existing models will decrease as the number of attributes increases, while our AAPM overcomes the attribute-bias and maintains good performance.
  • Figure 2: (a) illustrates the recognition process of exiting methods. During the optimization, the training data introduces specific attribute bias into the feature encoder, causing the model fail to generalize to more attributes. (b) illustrates the recognition process of AAPM. AAPM freezes the visual and text encoders to ensure that the model can extract robust raw features, and then applies TCM to impose constraints on the target features. (c) illustrates the operating principle of TCM. The robust unbiased features containing rich information, and TCM maps the feature to multiple attributes.
  • Figure 3: The query data has the same scene attribute with prototype 1, and the same attributes of human group and action with prototype 2. The prototypical confusion occurs when the prior prototype networks applied to the multiple attributes recognition.
  • Figure 4: The process of constructing visual prototypes via the support set. Firstly, the support videos are fed into the fixed video encoder to obtain video features, meanwhile the text of support labels are fed into the text encoder to obtain text features. Secondly, the textual features of different attributes are individually fed into the adaptive attribute block to learn respective attribute spaces. Finally, the attribute spaces constrain the visual features by the cross-attention block, and construct the support prototypes of different attributes.