Table of Contents
Fetching ...

Visual Goal-Step Inference using wikiHow

Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch

TL;DR

This work introduces Visual Goal-Step Inference (VGSI), a multimodal task that requires reasoning about sequences of steps to achieve a goal using images. Leveraging a large wikiHow-based dataset, the authors show current state-of-the-art multimodal models struggle with VGSI, yet learnable visual-language representations transfer to other instructional datasets like COIN and Howto100m. They formalize several models (DeViSE, Similarity Network, Triplet Network, LXMERT) and evaluate them on a 4-way MC setup with multiple negative-sample strategies, plus retrieval tasks. Transfer experiments demonstrate wikiHow pretraining provides strong cross-domain benefits, and a step-aggregation approach can further boost performance. The work highlights VGSI as a path toward richer multimodal procedural reasoning for applications in dialogue systems and robotics.

Abstract

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.

Visual Goal-Step Inference using wikiHow

TL;DR

This work introduces Visual Goal-Step Inference (VGSI), a multimodal task that requires reasoning about sequences of steps to achieve a goal using images. Leveraging a large wikiHow-based dataset, the authors show current state-of-the-art multimodal models struggle with VGSI, yet learnable visual-language representations transfer to other instructional datasets like COIN and Howto100m. They formalize several models (DeViSE, Similarity Network, Triplet Network, LXMERT) and evaluate them on a 4-way MC setup with multiple negative-sample strategies, plus retrieval tasks. Transfer experiments demonstrate wikiHow pretraining provides strong cross-domain benefits, and a step-aggregation approach can further boost performance. The work highlights VGSI as a path toward richer multimodal procedural reasoning for applications in dialogue systems and robotics.

Abstract

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.

Paper Structure

This paper contains 33 sections, 14 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: An example Visual Goal-Step Inference Task: given a text goal ( bake fish), select the image (C) that represents a step towards that goal.
  • Figure 2: Hierarchical multimodality of wikiHow.
  • Figure 3: Accuracy of human (circles) and model (triangles) on the modified wikiHow VGSI test set with different textual input (e.g., in Fig \ref{['fig:example']}, the goal prompt will be replaced by method - "Baking the Fish." or step - "Preheat the oven.").
  • Figure 4: Few-shot performance on COIN (similarity sampling) with different pre-training datasets vs. the number of examples per goal.
  • Figure 5: Transfer performance on Howto100m (similarity sampling) with different pre-training datasets vs. the number of training examples.
  • ...and 4 more figures