Table of Contents
Fetching ...

Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang

TL;DR

This work defines Narrative Action Evaluation (NAE), a task that couples rich narrative commentary with professional action assessment. It introduces a Prompt-Guided Multimodal Interaction framework that aligns video and language modalities through context-aware prompts, score-guided token learning, and a multimodal-aware text generator, reframing score prediction as a video-text matching problem. To support research, the authors re-annotate the MTL-AQA and FineGym datasets with high-quality narration and establish benchmarks, releasing code and data. Empirically, the approach outperforms prior methods on NAE and related AQA and captioning metrics, demonstrating the value of inter-modal interaction and learnable templating for professional narrative evaluation.

Abstract

In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at https://github.com/shiyi-zh0408/NAE_CVPR2024.

Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

TL;DR

This work defines Narrative Action Evaluation (NAE), a task that couples rich narrative commentary with professional action assessment. It introduces a Prompt-Guided Multimodal Interaction framework that aligns video and language modalities through context-aware prompts, score-guided token learning, and a multimodal-aware text generator, reframing score prediction as a video-text matching problem. To support research, the authors re-annotate the MTL-AQA and FineGym datasets with high-quality narration and establish benchmarks, releasing code and data. Empirically, the approach outperforms prior methods on NAE and related AQA and captioning metrics, demonstrating the value of inter-modal interaction and learnable templating for professional narrative evaluation.

Abstract

In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at https://github.com/shiyi-zh0408/NAE_CVPR2024.
Paper Structure (24 sections, 7 equations, 4 figures, 7 tables)

This paper contains 24 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: A comparison of our proposed narrative action evaluation (NAE) task with action quality assessment (AQA) and video captioning. The three lines in the figure represent the input video, the outputs of the three tasks, and the information contained in each task. In comparison to AQA, NAE provides rich language descriptions. When compared to Video Captioning, NAE includes much more evaluation information such as scores, actions, and qualitative evaluations, which is often rigorous and granular. In general, NAE aims to strike a balance between the professionalism of assessment information and the richness of language. This duality is both the characteristic and challenge of NAE.
  • Figure 2: The process of re-annotating a sample using ChatGPT, based on existing action and score labels. Convert2sentence(act, score) constructs a pre-fixed template to insert the action and score information into the template to generate a complete sentence.
  • Figure 3: The left part shows an overview of our Prompt-Guided Multimodal Interaction paradigm. First, we send the K-class Prompts into the text encoder to get K-class Prompt Embeddings. After that, we perform Context-Aware Prompt Learning using the video features based on Context-Aware Transformer. Second, in Score-Guided Tokens Learning, we interact the video embeddings from the video encoder with the K-class Prompts mentioned above through Score-Aware Transformer. Thirdly, we utilize Multimodal-Aware Text Generator with the Tri-Token Attention Mask to integrate the multimodal tokens from Score-Guided Tokens Learning and generate the text. The upper right part shows the Tri-Token Attention Mask and the bottom right part shows the learnable template in Multimodal-Aware Text Generator.
  • Figure 4: Qualitative results. Our model can generate detailed narrations including scores, actions, and qualitative evaluations to describe and evaluate the actions comprehensively. Notably, the model can analyze the quality of actions by pointing out the details of the execution.