Table of Contents
Fetching ...

ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Guoliang Xu, Jianqin Yin, Feng Zhou, Yonghao Dang

TL;DR

This work addresses the saturation of information in image-only group activity recognition by introducing ActivityCLIP, a plug-and-play framework that mines text semantics from action labels to supplement image cues. The approach combines an image branch with a text branch, where Image2Text transfers image information into text space under CLIP guidance via knowledge distillation, and a lightweight text-branch interaction module is injected into the image branch using low-rank adaptations $F(x)=W_0(x)+\alpha BA(x)$. Through training-time KD and parameter-efficient cross-modal coupling, ActivityCLIP improves performance across multiple GAR baselines on Volleyball and Collective Activity datasets, with ablations validating the contribution of Image2Text, the transformer-based interaction modeling, and the hyperparameters $r$ and $\alpha$. The results demonstrate that text-modality augmentation can provide robust, complementary cues for actor interactions, enhancing recognition accuracy in crowded scenes while maintaining efficiency. Practically, this plug-and-play method facilitates broad applicability to existing image-based GAR systems with minimal parameter overhead.

Abstract

Previous methods usually only extract the image modality's information to recognize group activity. However, mining image information is approaching saturation, making it difficult to extract richer information. Therefore, extracting complementary information from other modalities to supplement image information has become increasingly important. In fact, action labels provide clear text information to express the action's semantics, which existing methods often overlook. Thus, we propose ActivityCLIP, a plug-and-play method for mining the text information contained in the action labels to supplement the image information for enhancing group activity recognition. ActivityCLIP consists of text and image branches, where the text branch is plugged into the image branch (The off-the-shelf image-based method). The text branch includes Image2Text and relation modeling modules. Specifically, we propose the knowledge transfer module, Image2Text, which adapts image information into text information extracted by CLIP via knowledge distillation. Further, to keep our method convenient, we add fewer trainable parameters based on the relation module of the image branch to model interaction relation in the text branch. To show our method's generality, we replicate three representative methods by ActivityCLIP, which adds only limited trainable parameters, achieving favorable performance improvements for each method. We also conduct extensive ablation studies and compare our method with state-of-the-art methods to demonstrate the effectiveness of ActivityCLIP.

ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

TL;DR

This work addresses the saturation of information in image-only group activity recognition by introducing ActivityCLIP, a plug-and-play framework that mines text semantics from action labels to supplement image cues. The approach combines an image branch with a text branch, where Image2Text transfers image information into text space under CLIP guidance via knowledge distillation, and a lightweight text-branch interaction module is injected into the image branch using low-rank adaptations . Through training-time KD and parameter-efficient cross-modal coupling, ActivityCLIP improves performance across multiple GAR baselines on Volleyball and Collective Activity datasets, with ablations validating the contribution of Image2Text, the transformer-based interaction modeling, and the hyperparameters and . The results demonstrate that text-modality augmentation can provide robust, complementary cues for actor interactions, enhancing recognition accuracy in crowded scenes while maintaining efficiency. Practically, this plug-and-play method facilitates broad applicability to existing image-based GAR systems with minimal parameter overhead.

Abstract

Previous methods usually only extract the image modality's information to recognize group activity. However, mining image information is approaching saturation, making it difficult to extract richer information. Therefore, extracting complementary information from other modalities to supplement image information has become increasingly important. In fact, action labels provide clear text information to express the action's semantics, which existing methods often overlook. Thus, we propose ActivityCLIP, a plug-and-play method for mining the text information contained in the action labels to supplement the image information for enhancing group activity recognition. ActivityCLIP consists of text and image branches, where the text branch is plugged into the image branch (The off-the-shelf image-based method). The text branch includes Image2Text and relation modeling modules. Specifically, we propose the knowledge transfer module, Image2Text, which adapts image information into text information extracted by CLIP via knowledge distillation. Further, to keep our method convenient, we add fewer trainable parameters based on the relation module of the image branch to model interaction relation in the text branch. To show our method's generality, we replicate three representative methods by ActivityCLIP, which adds only limited trainable parameters, achieving favorable performance improvements for each method. We also conduct extensive ablation studies and compare our method with state-of-the-art methods to demonstrate the effectiveness of ActivityCLIP.
Paper Structure (16 sections, 4 equations, 6 figures, 5 tables)

This paper contains 16 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Image-based group activity recognition; (b) Ours (image-text-based group activity recognition). Our method can mine complementary information (Deeper colors indicate stronger interaction with actor 0.) from the text to supplement image information to enhance group activity recognition.
  • Figure 2: Overview of the ActivityCLIP. The process indicated by the green arrow only occurs during the training stage. Here, we use Dual-AI as the image branch to show the process of ActivityCLIP. Dual-AI employs two paths with different spatial-temporal orders (spatial-temporal and temporal-spatial paths) for interaction relation modeling. S-Trans and T-Trans represent the spatial-Transformer and temporal-Transformer, respectively. More details on Dual-AI can be found in han2022dual.
  • Figure 3: The structure of Image2Text. We indicate the feature's dimensional changes in each component.
  • Figure 4: The confusion matrix analysis on the ASTFormer li2022learningAction and ACLIP(Dual-AI). Figures (a) and (b) represent the confusion matrix of ASTFormer and ACLIP(Dual-AI), respectively, on the Volleyball dataset; Figures (c) and (d) represent the confusion matrix of both methods, respectively, on the Collective Activity dataset. Here, 'r' and 'l' are short for 'right' and 'left.'
  • Figure 5: The influence of text information on each activity category. We conducted these experiments on the Volleyball dataset. 'ACLIP(ARG) - ARG' represents using the number of correctly identified activities by ACLIP(ARG) in each category to subtract the number of correctly identified activities by ARG in the corresponding category. 'ACLIP(AFormer) - AFormer' and 'ACLIP(Dual-AI) - Dual-AI' represent the same meaning as 'ACLIP(ARG) - ARG'. Here, 'r' and 'l' are short for 'right' and 'left.'
  • ...and 1 more figures