Visual Goal-Step Inference using wikiHow
Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch
TL;DR
This work introduces Visual Goal-Step Inference (VGSI), a multimodal task that requires reasoning about sequences of steps to achieve a goal using images. Leveraging a large wikiHow-based dataset, the authors show current state-of-the-art multimodal models struggle with VGSI, yet learnable visual-language representations transfer to other instructional datasets like COIN and Howto100m. They formalize several models (DeViSE, Similarity Network, Triplet Network, LXMERT) and evaluate them on a 4-way MC setup with multiple negative-sample strategies, plus retrieval tasks. Transfer experiments demonstrate wikiHow pretraining provides strong cross-domain benefits, and a step-aggregation approach can further boost performance. The work highlights VGSI as a path toward richer multimodal procedural reasoning for applications in dialogue systems and robotics.
Abstract
Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.
