Table of Contents
Fetching ...

Predicting Implicit Arguments in Procedural Video Instructions

Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

TL;DR

Implicit-VidSRL introduces a multimodal SRL benchmark for procedural videos that emphasizes implicit arguments, encoding steps as semantic frames {verb, what, where/with}. The authors annotate and leverage a silver-standard SRL dataset to train iSRL-Qwen2-VL, demonstrating significant improvements in implicit-argument prediction and next-step generation over baselines like GPT-4o. Key findings show multimodal context and SRL-informed prompting substantially enhance long-horizon reasoning in cooking procedures, with a practical impact on instruction personalization and human-robot collaboration. This work provides a new dataset, a silver-standard annotation pipeline, and a model that advances fine-grained, context-aware procedural understanding.

Abstract

Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.

Predicting Implicit Arguments in Procedural Video Instructions

TL;DR

Implicit-VidSRL introduces a multimodal SRL benchmark for procedural videos that emphasizes implicit arguments, encoding steps as semantic frames {verb, what, where/with}. The authors annotate and leverage a silver-standard SRL dataset to train iSRL-Qwen2-VL, demonstrating significant improvements in implicit-argument prediction and next-step generation over baselines like GPT-4o. Key findings show multimodal context and SRL-informed prompting substantially enhance long-horizon reasoning in cooking procedures, with a practical impact on instruction personalization and human-robot collaboration. This work provides a new dataset, a silver-standard annotation pipeline, and a model that advances fine-grained, context-aware procedural understanding.

Abstract

Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like {verb,what,where/with}. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step's where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models' contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.

Paper Structure

This paper contains 48 sections, 1 equation, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Implicit-VidSRL: A new semantic role labeling (SRL) based dataset, to represent procedural videos using semantic frames ({verb,what,where/with}) with implict arguments. For instance step 2 is transformed into step 2(a) & step 2(b). While in step 5 the arguments are implicit and require both visual and textual context to infer from step 3 & 2. The implicit information is emphasized using a background color.
  • Figure 2: The Implicit Argument Prediction task involves providing the input sequence, which may be in the form of text or video or both, to a multimodal large model, alongside masked semantic frames: The arguments that are highlighted with red boxes in the output structure are not provided as part of the input and have to be predicted.
  • Figure 3: Qualitative example using video-only predictions. The example is from TASTY sener2022transferring with ID-https://tasty.co/recipe/cider-pulled-pork. The examples highlight common errors in the predictions, i.e., a failure to track the mixture ingredients, as in step 3(a) pork is mixed with spices. Incorrect predictions are highlighted in red and missing ingredients are indicated using '??'.
  • Figure 4: Comparison of GPT-4o and our iSRL-Qwen2-VL model for argument prediction across semantic frame positions in multi-modal procedural inputs (V+T).
  • Figure 5: Annotation Tool. The images in (a) & (b) shows the tool interface to annotate the implicit entities during Stage 1. While the image in (c) shows the tool interface for semantic role labeling in Stage 3 (the video is omitted for clarity).
  • ...and 6 more figures