Table of Contents
Fetching ...

Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

Masato Tamura

TL;DR

This work tackles social group activity recognition by shifting from region-based features to transformer-driven social group features that aggregate context from whole frames. It introduces efficient designs for group queries and a divided self-attention mechanism to manage a large number of embeddings while ensuring non-duplicated member assignment. Through comprehensive experiments on Volleyball and Collective Activity, the approach achieves state-of-the-art performance in detection-based settings and offers deep insights into query design, attention schemes, and the trade-offs between group size and member proximity. The proposed method demonstrates strong robustness to group size and member distribution, with practical implications for surveillance, sports analytics, and social scene understanding, aided by RGB-only feature processing. The key contribution is a scalable, context-rich transformer framework that identifies social group members and predicts group activities without heavy reliance on individual region features.

Abstract

Social group activity recognition is a challenging task extended from group activity recognition, where social groups must be recognized with their activities and group members. Existing methods tackle this task by leveraging region features of individuals following existing group activity recognition methods. However, the effectiveness of region features is susceptible to person localization and variable semantics of individual actions. To overcome these issues, we propose leveraging attention modules in transformers to generate social group features. In this method, multiple embeddings are used to aggregate features for a social group, each of which is assigned to a group member without duplication. Due to this non-duplicated assignment, the number of embeddings must be significant to avoid missing group members and thus renders attention in transformers ineffective. To find optimal attention designs with a large number of embeddings, we explore several design choices of queries for feature aggregation and self-attention modules in transformer decoders. Extensive experimental results show that the proposed method achieves state-of-the-art performance and verify that the proposed attention designs are highly effective on social group activity recognition.

Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

TL;DR

This work tackles social group activity recognition by shifting from region-based features to transformer-driven social group features that aggregate context from whole frames. It introduces efficient designs for group queries and a divided self-attention mechanism to manage a large number of embeddings while ensuring non-duplicated member assignment. Through comprehensive experiments on Volleyball and Collective Activity, the approach achieves state-of-the-art performance in detection-based settings and offers deep insights into query design, attention schemes, and the trade-offs between group size and member proximity. The proposed method demonstrates strong robustness to group size and member distribution, with practical implications for surveillance, sports analytics, and social scene understanding, aided by RGB-only feature processing. The key contribution is a scalable, context-rich transformer framework that identifies social group members and predicts group activities without heavy reliance on individual region features.

Abstract

Social group activity recognition is a challenging task extended from group activity recognition, where social groups must be recognized with their activities and group members. Existing methods tackle this task by leveraging region features of individuals following existing group activity recognition methods. However, the effectiveness of region features is susceptible to person localization and variable semantics of individual actions. To overcome these issues, we propose leveraging attention modules in transformers to generate social group features. In this method, multiple embeddings are used to aggregate features for a social group, each of which is assigned to a group member without duplication. Due to this non-duplicated assignment, the number of embeddings must be significant to avoid missing group members and thus renders attention in transformers ineffective. To find optimal attention designs with a large number of embeddings, we explore several design choices of queries for feature aggregation and self-attention modules in transformer decoders. Extensive experimental results show that the proposed method achieves state-of-the-art performance and verify that the proposed attention designs are highly effective on social group activity recognition.
Paper Structure (27 sections, 3 equations, 6 figures, 12 tables)

This paper contains 27 sections, 3 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Overviews of existing and proposed social group activity recognition methods. Existing methods first extract region features for individuals and then split them into social groups while the proposed method extracts social group features to recognize social group activities and identify group members.
  • Figure 2: Example case where scene contexts are essential for activity recognition. In the Volleyball dataset, the trajectories of balls are one of the important clues for recognizing group activities. The proposed method shows strong attention at the place above players, where balls typically pass.
  • Figure 3: Overall architecture of the proposed method.
  • Figure 4: Several design choices of the self-attention module in the transformer decoder. The colors of the embeddings indicate embedding sets. Only embeddings in the same set interact with each other using the attention mechanism. The embeddings of white colors are not fed into the attention module, which means that only one embedding in each group is fed into the first self-attention module in the divided attention implementation. Residual connections are omitted in this figure.
  • Figure 5: Performances by group sizes and maximum distances of group members. The depth of the color reflects the value in each cell.
  • ...and 1 more figures