Table of Contents
Fetching ...

Towards More Practical Group Activity Detection: A New Benchmark and Model

Dongkeun Kim, Youngkil Song, Minsu Cho, Suha Kwak

TL;DR

This work targets practical group activity detection by identifying group members and labeling each group's activity in videos. It introduces Café, a large-scale, multi-view GAD dataset with rich annotations, and a Transformer-based GAD model that uses learnable group tokens to handle an unknown number of groups and latent members without clustering. Training employs four losses, including a group consistency objective with a bipartite Hungarian matching and a group-embedding–actor-embedding affinity mechanism, enabling end-to-end optimization. Empirical results demonstrate that the proposed method outperforms prior approaches on Café and other benchmarks in both accuracy (Group mAP) and inference speed, paving the way for real-world GAD applications. Group IoU is used to assess localization, defined as $\text{Group IoU}(G,\hat{G})=\frac{|G\cap\hat{G}|}{|G\cup\hat{G}|}$, and Outlier mIoU evaluates singleton outlier detection via a corresponding IoU-based metric $\text{Outlier mIoU}=\frac{1}{|V|}\sum_{v\in V}\frac{|O_v\cap\hat{O}_v|}{|O_v\cup\hat{O}_v|}$. The work presents a practical benchmark and a scalable, accurate model with direct implications for surveillance, social scene understanding, and crowd analytics.

Abstract

Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. Unlike existing datasets, Café is constructed primarily for GAD and presents more practical scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Café, where it outperformed previous work in terms of both accuracy and inference speed.

Towards More Practical Group Activity Detection: A New Benchmark and Model

TL;DR

This work targets practical group activity detection by identifying group members and labeling each group's activity in videos. It introduces Café, a large-scale, multi-view GAD dataset with rich annotations, and a Transformer-based GAD model that uses learnable group tokens to handle an unknown number of groups and latent members without clustering. Training employs four losses, including a group consistency objective with a bipartite Hungarian matching and a group-embedding–actor-embedding affinity mechanism, enabling end-to-end optimization. Empirical results demonstrate that the proposed method outperforms prior approaches on Café and other benchmarks in both accuracy (Group mAP) and inference speed, paving the way for real-world GAD applications. Group IoU is used to assess localization, defined as , and Outlier mIoU evaluates singleton outlier detection via a corresponding IoU-based metric . The work presents a practical benchmark and a scalable, accurate model with direct implications for surveillance, social scene understanding, and crowd analytics.

Abstract

Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. Unlike existing datasets, Café is constructed primarily for GAD and presents more practical scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Café, where it outperformed previous work in terms of both accuracy and inference speed.
Paper Structure (34 sections, 12 equations, 9 figures, 12 tables)

This paper contains 34 sections, 12 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Examples of videos in Café. The videos were taken at six different places and four cameras with different viewpoints in each place.
  • Figure 2: A summary statistics of Café. (a) Group population versus group size per activity class. (b) Distribution of the number of actors in each video frame.
  • Figure 3: Comparison between Café and existing GAD datasets in terms of (a) group size, (b) aspect ratios of actor boxes, (c) population density, and (d) inter-group distance.
  • Figure 3: Quantitative results on JRDB-Act validation-set.
  • Figure 4: (Left) Overall architecture of our model. (Right) Detailed architecture of the Grouping Transformer.
  • ...and 4 more figures