Table of Contents
Fetching ...

Detours for Navigating Instructional Videos

Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman

TL;DR

The paper tackles the challenge of navigating instructional videos by introducing the video detour problem, where a user seeks a detour video and a corresponding time window to modify a current task described in a source video. It proposes VidDetours, a two-stage video-language model that retrieves a detour video and localizes the relevant segment conditioned on the source video context and a natural language query, using a weakly supervised data generation pipeline based on HowTo100M narrations and LLMs. A large-scale gold-standard test set (4K videos, 16K questions) demonstrates superior performance over state-of-the-art video retrieval and localization baselines, with significant gains in recall and precise temporal localization. The work lays the groundwork for an interconnected how-to knowledge base and provides a benchmark for future research in personalized, query-driven navigation of instructional videos.

Abstract

We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore, we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos, where a user can detour from their current recipe to find steps with alternate ingredients, tools, and techniques. Validating on a ground truth annotated dataset of 16K samples, we show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.

Detours for Navigating Instructional Videos

TL;DR

The paper tackles the challenge of navigating instructional videos by introducing the video detour problem, where a user seeks a detour video and a corresponding time window to modify a current task described in a source video. It proposes VidDetours, a two-stage video-language model that retrieves a detour video and localizes the relevant segment conditioned on the source video context and a natural language query, using a weakly supervised data generation pipeline based on HowTo100M narrations and LLMs. A large-scale gold-standard test set (4K videos, 16K questions) demonstrates superior performance over state-of-the-art video retrieval and localization baselines, with significant gains in recall and precise temporal localization. The work lays the groundwork for an interconnected how-to knowledge base and provides a benchmark for future research in personalized, query-driven navigation of instructional videos.

Abstract

We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore, we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos, where a user can detour from their current recipe to find steps with alternate ingredients, tools, and techniques. Validating on a ground truth annotated dataset of 16K samples, we show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.
Paper Structure (19 sections, 3 equations, 9 figures, 4 tables)

This paper contains 19 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An example video detour. In the Chicken Quesadillas recipe, the source video $V_s$ (top) shows the use of an electric grill at time instant $t_s$. A user watching this video does not have a grill and asks a query $\mathcal{Q}$"how to do this without an electric grill?". In response, the system identifies a detour video $V_d$ and timepoint $T_d$ showing a similar recipe but using a heating pan instead of a grill.
  • Figure 1: Weakly-supervised summaries generated using narrations with LLAMA 2 llama2. While majority of the outputs contains step details and timestamps in the desired format, a few outputs are incorrect (bottom).
  • Figure 2: Overview of the detours dataset ($\mathcal{D}_D^{tr}$) curation. Given unlabeled instructional videos for training (we use HowTo100M howto100m), we first input their narrations with timestamps to a language model (LLAMA2 llama2) to obtain summaries of their steps. Next, we automatically select pairs of similar summaries along with their timestamps and use a language model to generate weakly-supervised detours annotation tuples $(V_s, t_s, \mathcal{Q}, V_d, T_d)$. As an example, the source video here uses smooth peanut butter. A possible detour question is "can I use chunky peanut butter here?" and the window at $T_d$ in the detour video (top right, orange) shows the use of crunchy peanut butter.
  • Figure 2: Weakly-supervised detour annotation sample for training and validation. It also contains a row of failure cases with reasons. Please also see the attached visualization for more visualizations.
  • Figure 3: Visualization of most frequent bigrams of the queries in the manually annotated test set. We see that most of the queries have little or no context about the current recipe and the step being executed, e.g. "how do I plate this differently?" or "can I do this step using a spatula?"---emphasizing the need for source video context, as we explore in the proposed model.
  • ...and 4 more figures