SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
Tao Wu, Runyu He, Gangshan Wu, Limin Wang
TL;DR
This paper introduces SportsHHI, a dataset and task for video human-human interaction detection in sports, addressing the lack of high-level multi-person interaction definitions in existing datasets. It defines 34 interaction classes across basketball and volleyball and provides richly annotated keyframes with extensive human bounding boxes and interaction instances to facilitate spatio-temporal reasoning. A two-stage baseline leveraging motion, context, and relative-position cues, plus information exchange among proposals, demonstrates the importance of temporal dynamics and detailed action modeling, with VideoMAE-based backbones yielding notable gains. The work positions SportsHHI as a benchmark to spur advances in explicit human-human interaction understanding in videos and the development of robust spatio-temporal context modeling techniques.
Abstract
Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
