Table of Contents
Fetching ...

Multimodal Action Quality Assessment

Ling-An Zeng, Wei-Shi Zheng

TL;DR

This work tackles action quality assessment (AQA) by introducing PAMFN, a multimodal network that treats RGB, optical flow, and audio as separate modality-specific streams and a progressively learned mixed-modality branch. The approach combines a Modality-specific Feature Decoder, an Adaptive Fusion Module with ranked FusionNets and a PolicyNet, and a Cross-modal Feature Decoder to selectively transfer information across modalities, trained in two stages. Empirical results on the Rhythmic Gymnastics and Fis-V datasets show state-of-the-art performance, with substantial gains over prior visual-only AQA methods and competitive results against multimodal baselines; ablations confirm the value of each component and the adaptive fusion policy. The architecture also generalizes to highlight detection, indicating practical utility beyond AQA and potential for broader multimodal video understanding tasks.

Abstract

Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch.

Multimodal Action Quality Assessment

TL;DR

This work tackles action quality assessment (AQA) by introducing PAMFN, a multimodal network that treats RGB, optical flow, and audio as separate modality-specific streams and a progressively learned mixed-modality branch. The approach combines a Modality-specific Feature Decoder, an Adaptive Fusion Module with ranked FusionNets and a PolicyNet, and a Cross-modal Feature Decoder to selectively transfer information across modalities, trained in two stages. Empirical results on the Rhythmic Gymnastics and Fis-V datasets show state-of-the-art performance, with substantial gains over prior visual-only AQA methods and competitive results against multimodal baselines; ablations confirm the value of each component and the adaptive fusion policy. The architecture also generalizes to highlight detection, indicating practical utility beyond AQA and potential for broader multimodal video understanding tasks.

Abstract

Action quality assessment (AQA) is to assess how well an action is performed. Previous works perform modelling by only the use of visual information, ignoring audio information. We argue that although AQA is highly dependent on visual information, the audio is useful complementary information for improving the score regression accuracy, especially for sports with background music, such as figure skating and rhythmic gymnastics. To leverage multimodal information for AQA, i.e., RGB, optical flow and audio information, we propose a Progressive Adaptive Multimodal Fusion Network (PAMFN) that separately models modality-specific information and mixed-modality information. Our model consists of with three modality-specific branches that independently explore modality-specific information and a mixed-modality branch that progressively aggregates the modality-specific information from the modality-specific branches. To build the bridge between modality-specific branches and the mixed-modality branch, three novel modules are proposed. First, a Modality-specific Feature Decoder module is designed to selectively transfer modality-specific information to the mixed-modality branch. Second, when exploring the interaction between modality-specific information, we argue that using an invariant multimodal fusion policy may lead to suboptimal results, so as to take the potential diversity in different parts of an action into consideration. Therefore, an Adaptive Fusion Module is proposed to learn adaptive multimodal fusion policies in different parts of an action. This module consists of several FusionNets for exploring different multimodal fusion strategies and a PolicyNet for deciding which FusionNets are enabled. Third, a module called Cross-modal Feature Decoder is designed to transfer cross-modal features generated by Adaptive Fusion Module to the mixed-modality branch.
Paper Structure (31 sections, 13 equations, 9 figures, 9 tables)

This paper contains 31 sections, 13 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The overall architecture of our proposed PAMFN. The RGB, optical flow and audio information are fed into three pretrained backbones to extract features respectively. Three modality-specific branches with the same structure are independently pretrained to explore the modality-specific information. Then, a mixed-modality branch progressively aggregates the modality-specific information via a Modality-specific Feature Decoder (MSFD) module and the cross-modal information via an Adaptive Fusion Module (AFM) and a Cross-modal Feature Decoder (CMFD) module. The Adaptive Fusion Module explores an adaptive cross-modal fusion policy. Our network is trained in two phases. We first separately train the modality-specific branches. Then we fix modality-specific branches except the audio branch since the quality of an action is almost impossible to assess using only audio information, and train the mixed-modality branch, MSFD, AFM and CMFD. Note that the initial mixed-modality features are initialized by zeros.
  • Figure 2: Illustration of the Modality-specific Feature Decoder module. $\otimes$ denotes the matrix multiplication. The shapes of important tensors are shown in the figure. Query, Key and Value are three different linear projections.
  • Figure 3: Illustration of the Adaptive Fusion Module. $K$ different FusionNets do not share parameters and explore different fusion policies. The PolicyNet generates an decision $a_{i,t}$ that determines which fusion strategies are enabled. $M_{i,t}$ is a binary mask vector that masks the not enabled cross-modal features.
  • Figure 4: Illustration of the Cross-modal Feature Decoder module. $\otimes$ denotes the matrix multiplication. $\oplus$ denotes the element-wise sum. The shapes of important tensors are shown in the figure. Query, Key and Value represent three different linear projections.
  • Figure 5: The frequency of the generated decisions at three stages on RG dataset. The line labeled with Uniform denotes the frequency of decisions under a uniform distribution. Best viewed in color.
  • ...and 4 more figures