Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition
Masato Tamura
TL;DR
This work tackles social group activity recognition by shifting from region-based features to transformer-driven social group features that aggregate context from whole frames. It introduces efficient designs for group queries and a divided self-attention mechanism to manage a large number of embeddings while ensuring non-duplicated member assignment. Through comprehensive experiments on Volleyball and Collective Activity, the approach achieves state-of-the-art performance in detection-based settings and offers deep insights into query design, attention schemes, and the trade-offs between group size and member proximity. The proposed method demonstrates strong robustness to group size and member distribution, with practical implications for surveillance, sports analytics, and social scene understanding, aided by RGB-only feature processing. The key contribution is a scalable, context-rich transformer framework that identifies social group members and predicts group activities without heavy reliance on individual region features.
Abstract
Social group activity recognition is a challenging task extended from group activity recognition, where social groups must be recognized with their activities and group members. Existing methods tackle this task by leveraging region features of individuals following existing group activity recognition methods. However, the effectiveness of region features is susceptible to person localization and variable semantics of individual actions. To overcome these issues, we propose leveraging attention modules in transformers to generate social group features. In this method, multiple embeddings are used to aggregate features for a social group, each of which is assigned to a group member without duplication. Due to this non-duplicated assignment, the number of embeddings must be significant to avoid missing group members and thus renders attention in transformers ineffective. To find optimal attention designs with a large number of embeddings, we explore several design choices of queries for feature aggregation and self-attention modules in transformer decoders. Extensive experimental results show that the proposed method achieves state-of-the-art performance and verify that the proposed attention designs are highly effective on social group activity recognition.
