Table of Contents
Fetching ...

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Yuxiao Chen, Kai Li, Wentao Bao, Deep Patel, Yu Kong, Martin Renqiang Min, Dimitris N. Metaxas

TL;DR

This work first applies an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations to generate reliable pseudo step-video matching, and proposes the Multi-Pathway Text-Video Alignment (MPTVA) strategy, which surpasses state-of-the-art methods in three downstream tasks.

Abstract

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9\%, 3.1\%, and 2.8\%.

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

TL;DR

This work first applies an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations to generate reliable pseudo step-video matching, and proposes the Multi-Pathway Text-Video Alignment (MPTVA) strategy, which surpasses state-of-the-art methods in three downstream tasks.

Abstract

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9\%, 3.1\%, and 2.8\%.
Paper Structure (28 sections, 11 equations, 5 figures, 5 tables)

This paper contains 28 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of different text information in narrated instruction videos. The orange and green bars along the time (t) axis denote the temporal boundaries of procedure steps and timestamps of narrations, respectively. Sentences highlighted in green indicate task-relevant information, while those in red are task-irrelevant.
  • Figure 1: Comparison of different types of text information associated with the instruction video. The sentences highlighted in red are irrelevant to the tasks demonstrated in the video.
  • Figure 2: Overview of our proposed method. (a): We first use an LLM to summarize task-relevant LLM-steps from narrations. (b): We then extract the pseudo-matching between LLM-steps and video segments using our proposed MPTVA. (c): The extracted pseudo-alignments are used as supervision to train the model to minimize the MIL-NCE loss. (d): The illustration of the proposed MPTVA strategy for pseudo-label generation.
  • Figure 2: Comparison of different types of text information associated with the instruction video. The sentences highlighted in red are irrelevant to the tasks demonstrated in the video.
  • Figure 3: Comparison of different types of text information associated with the instruction video. The sentences highlighted in red are irrelevant to the tasks demonstrated in the video.