Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, Nanyun Peng
TL;DR
The paper tackles the problem of understanding procedural knowledge by sequencing unordered multimodal instructions. It introduces two datasets, WikiHow and RecipeQA, with comprehensive human annotations including alternative orders, and proposes sequence-aware pretraining to exploit sequential alignments in text and images. Evaluations with encoder–decoder architectures (e.g., VisualBERT, CLIP-ViL, BERSON) show modest but consistent gains (>5% PMR) from the proposed pretraining, while humans remain substantially superior. The work highlights the value of multimodal grounding for procedure understanding and provides datasets and methods that spur future research on dependencies and parallelizable task steps in real-world instructions.
Abstract
The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.
