Table of Contents
Fetching ...

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

Tao Wu, Runyu He, Gangshan Wu, Limin Wang

TL;DR

This paper introduces SportsHHI, a dataset and task for video human-human interaction detection in sports, addressing the lack of high-level multi-person interaction definitions in existing datasets. It defines 34 interaction classes across basketball and volleyball and provides richly annotated keyframes with extensive human bounding boxes and interaction instances to facilitate spatio-temporal reasoning. A two-stage baseline leveraging motion, context, and relative-position cues, plus information exchange among proposals, demonstrates the importance of temporal dynamics and detailed action modeling, with VideoMAE-based backbones yielding notable gains. The work positions SportsHHI as a benchmark to spur advances in explicit human-human interaction understanding in videos and the development of robust spatio-temporal context modeling techniques.

Abstract

Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

TL;DR

This paper introduces SportsHHI, a dataset and task for video human-human interaction detection in sports, addressing the lack of high-level multi-person interaction definitions in existing datasets. It defines 34 interaction classes across basketball and volleyball and provides richly annotated keyframes with extensive human bounding boxes and interaction instances to facilitate spatio-temporal reasoning. A two-stage baseline leveraging motion, context, and relative-position cues, plus information exchange among proposals, demonstrates the importance of temporal dynamics and detailed action modeling, with VideoMAE-based backbones yielding notable gains. The work positions SportsHHI as a benchmark to spur advances in explicit human-human interaction understanding in videos and the development of robust spatio-temporal context modeling techniques.

Abstract

Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
Paper Structure (15 sections, 3 equations, 11 figures, 11 tables)

This paper contains 15 sections, 3 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Comparison between previous video visual relation detection datasets and our SportsHHI. In the upper row, we show three relation instances from VidVRD and AG datasets. These datasets rarely involve human-human interaction and define semantically simple relations that can be recognized by appearance or prior information. In contrast, the bottom row shows interaction annotations in two sample keyframes of SportsHHI. The bounding boxes and interaction annotation of the same instance are displayed in the same color. SportsHHI provides complex multi-person scenes where various interactions between human pairs occur concurrently. It focuses on high-level interactions that require detailed spatio-temporal context reasoning.
  • Figure 2: User interface for interaction annotation. The person bounding boxes and ids in the keyframe are shown in the left. We can play the video for context information. To add an interaction instance in the current keyframe, the subject person id, object person id, and interaction class should be specified.
  • Figure 3: Interaction classes hierarchy. There are 34 interaction classes of high-level semantics in total in SportsHHI. 16 for basketball and 18 for volleyball.
  • Figure 4: The number of interaction instances of each class sorted by descending order.
  • Figure 5: Statistics comparisons between SportsHHI and VidVRD. In the left, we compare the distribution of the number of instances in each keyframe. SportsHHI has more keyframes of fewer instances because of the high-level interaction class definition and the property of sports videos. In the right, we compare the distribution of GIoU between the subject and object. The proportion of instances of extremely high and extremely low GIoU between subject and object are both higher than VidVRD,
  • ...and 6 more figures