Table of Contents
Fetching ...

LOGO: A Long-Form Video Dataset for Group Action Quality Assessment

Shiyi Zhang, Wenxun Dai, Sujia Wang, Xiangwei Shen, Jiwen Lu, Jie Zhou, Yansong Tang

TL;DR

LOGO tackles action quality assessment in multi-person, long-form videos, addressing limitations of prior datasets that focus on single-person, short-duration actions. The authors introduce LOGO with 200 long-form artistic-swimming videos, 8 athletes per sample, and rich annotations for actions and formations, enabling group-aware modeling. They propose GOAT, a plug-and-play group-aware attention module built from a group-aware GCN and temporal fusion, to capture inter-actor relations and long-term temporal structure. Experimental results show GOAT yields state-of-the-art performance on LOGO and generalizes to other AQA and action-segmentation tasks, highlighting the value of explicit group information in AQA.

Abstract

Action quality assessment (AQA) has become an emerging topic since it can be extensively applied in numerous scenarios. However, most existing methods and datasets focus on single-person short-sequence scenes, hindering the application of AQA in more complex situations. To address this issue, we construct a new multi-person long-form video dataset for action quality assessment named LOGO. Distinguished in scenario complexity, our dataset contains 200 videos from 26 artistic swimming events with 8 athletes in each sample along with an average duration of 204.2 seconds. As for richness in annotations, LOGO includes formation labels to depict group information of multiple athletes and detailed annotations on action procedures. Furthermore, we propose a simple yet effective method to model relations among athletes and reason about the potential temporal logic in long-form videos. Specifically, we design a group-aware attention module, which can be easily plugged into existing AQA methods, to enrich the clip-wise representations based on contextual group information. To benchmark LOGO, we systematically conduct investigations on the performance of several popular methods in AQA and action segmentation. The results reveal the challenges our dataset brings. Extensive experiments also show that our approach achieves state-of-the-art on the LOGO dataset. The dataset and code will be released at \url{https://github.com/shiyi-zh0408/LOGO }.

LOGO: A Long-Form Video Dataset for Group Action Quality Assessment

TL;DR

LOGO tackles action quality assessment in multi-person, long-form videos, addressing limitations of prior datasets that focus on single-person, short-duration actions. The authors introduce LOGO with 200 long-form artistic-swimming videos, 8 athletes per sample, and rich annotations for actions and formations, enabling group-aware modeling. They propose GOAT, a plug-and-play group-aware attention module built from a group-aware GCN and temporal fusion, to capture inter-actor relations and long-term temporal structure. Experimental results show GOAT yields state-of-the-art performance on LOGO and generalizes to other AQA and action-segmentation tasks, highlighting the value of explicit group information in AQA.

Abstract

Action quality assessment (AQA) has become an emerging topic since it can be extensively applied in numerous scenarios. However, most existing methods and datasets focus on single-person short-sequence scenes, hindering the application of AQA in more complex situations. To address this issue, we construct a new multi-person long-form video dataset for action quality assessment named LOGO. Distinguished in scenario complexity, our dataset contains 200 videos from 26 artistic swimming events with 8 athletes in each sample along with an average duration of 204.2 seconds. As for richness in annotations, LOGO includes formation labels to depict group information of multiple athletes and detailed annotations on action procedures. Furthermore, we propose a simple yet effective method to model relations among athletes and reason about the potential temporal logic in long-form videos. Specifically, we design a group-aware attention module, which can be easily plugged into existing AQA methods, to enrich the clip-wise representations based on contextual group information. To benchmark LOGO, we systematically conduct investigations on the performance of several popular methods in AQA and action segmentation. The results reveal the challenges our dataset brings. Extensive experiments also show that our approach achieves state-of-the-art on the LOGO dataset. The dataset and code will be released at \url{https://github.com/shiyi-zh0408/LOGO }.
Paper Structure (17 sections, 6 equations, 6 figures, 6 tables)

This paper contains 17 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An overview of the LOGO dataset. LOGO is a multi-person long-form video dataset with frame-wise annotations on both action procedures (as shown in the second line) and formations (as shown in the third line, which reflects relations among actors) based on artistic swimming scenarios. It provides a potential for constructing an action quality assessment approach with the ability to model group information among actors. Longer video durations also challenge the ability of the method to aggregate long-term temporal information.
  • Figure 2: A tree structure of LOGO Taxonomy. LOGO organizes both the Actions and Formations annotations hierarchically. The left part shows the Actions categories of Technical and Free events. The right part depicts the formation annotation instances when the group is doing Required, Upper, Lower or Float actions (the right sub-tree of Actions) and not when the group is doing other actions, during which the formations are indistinguishable.
  • Figure 3: Statics of LOGO. (a) The score distribution of videos. (b) The action-type distribution of frames.
  • Figure 4: An overview of our group-aware approach for action quality assessment. First, we divide the video into several short clips of equal length. For each clip, we take the middle frame and perform object detection to get the bounding boxes of the actors. Then we send the frame into a CNN to extract the features of the actors. Then we use the feature vector of each actor as the node to construct the relation graph. We use Graph Convolutional Network to enhance the features in the graph, and send the output features into GOAT as "queries" and "keys". And the "values" are the features obtained from the clip by the video feature backbone such as I3D and video swin-transformer. Thus the aggregation in time can be completed. Finally, the output features are sent into the assessment network to predict the scores.
  • Figure 5: The visualization of the output of our proposed GOAT in action quality assessment. We use red stars to denote clips with high weight while using blue stars to represent clips with low weight. Our approach can focus on where the athletes perform effective movements with clear formations while it can also ignore the redundant part such as all actors are under-water.
  • ...and 1 more figures