Table of Contents
Fetching ...

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, Nanyun Peng

TL;DR

The paper tackles the problem of understanding procedural knowledge by sequencing unordered multimodal instructions. It introduces two datasets, WikiHow and RecipeQA, with comprehensive human annotations including alternative orders, and proposes sequence-aware pretraining to exploit sequential alignments in text and images. Evaluations with encoder–decoder architectures (e.g., VisualBERT, CLIP-ViL, BERSON) show modest but consistent gains (>5% PMR) from the proposed pretraining, while humans remain substantially superior. The work highlights the value of multimodal grounding for procedure understanding and provides datasets and methods that spur future research on dependencies and parallelizable task steps in real-world instructions.

Abstract

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.

Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals

TL;DR

The paper tackles the problem of understanding procedural knowledge by sequencing unordered multimodal instructions. It introduces two datasets, WikiHow and RecipeQA, with comprehensive human annotations including alternative orders, and proposes sequence-aware pretraining to exploit sequential alignments in text and images. Evaluations with encoder–decoder architectures (e.g., VisualBERT, CLIP-ViL, BERSON) show modest but consistent gains (>5% PMR) from the proposed pretraining, while humans remain substantially superior. The work highlights the value of multimodal grounding for procedure understanding and provides datasets and methods that spur future research on dependencies and parallelizable task steps in real-world instructions.

Abstract

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.

Paper Structure

This paper contains 41 sections, 1 equation, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Multimodal task procedure sequencing: The left column shows unordered instruction steps from the manual How To Make Wood Signs. Each step is a text description and its associated image. Without the complementary information from the visuals, a novice may have difficulty inferring the proper task order. Considering multimodal information, the proper order can be correctly inferred (right column).
  • Figure 2: Sequence-aware pretraining includes: (1) masked language modeling (MLM), (2) image-swapping prediction (ISP/PISP) which requires the model to predict if some images (image-patches) are swapped, and (3) sequential masked region modeling (SMRM) where models are asked to reconstruct masked regions in each image within the input sequence.
  • Figure 3: Top-3 and least-2 categories of human-model performance difference (in PMR): The selected categories have >10 samples. The difference bars on the multimodal model series are compared against the text-only model series.
  • Figure 4: Qualitative examples: We show some qualitative samples of our dataset associated with human and model predictions, and the annotated multi-reference ground truths. The texts are truncated to fit into the box shown in each sample. The performance are: (single-reference, multi-reference) accuracy metric respectively.
  • Figure 5: MTurk Annotation User Interface:(a) We ask the annotator to follow the indicated instruction, and perform the sequencing task. (b) The annotation task is designed for an intuitive drag-and-drop usage, followed by a few additional questions such as confidence level and whether each modality helps. (This example is obtained from RecipeQA dataset.)
  • ...and 1 more figures