Table of Contents
Fetching ...

Show and Guide: Instructional-Plan Grounded Vision and Language Model

Diogo Glória-Silva, David Semedo, João Magalhães

TL;DR

MM-PlanLLM is presented, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information, and it is shown that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.

Abstract

Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.

Show and Guide: Instructional-Plan Grounded Vision and Language Model

TL;DR

MM-PlanLLM is presented, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information, and it is shown that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.

Abstract

Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.
Paper Structure (44 sections, 3 equations, 7 figures, 6 tables)

This paper contains 44 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Example of a plan-grounded multimodal dialogue. The proposed model has the ability to understand and respond to multimodal input, provide relevant information from multiple knowledge sources, and guide the user through a complex task while adhering to a structured plan.
  • Figure 2: Comprehensive illustration of the MM-PlanLLM architecture, including the 3 training stages employed for model training. *Denotes the [RET] token embedding representations and the Language Modeling Head of the LLM remain trainable.
  • Figure 3: Text-query to visual plan alignment. MM-PlanLLM effectively learns to align textual [RET] token representations with that of the target step frames. We remove outliers for clarity.
  • Figure 4: Image-query to text plan alignment. Most similar plan step to the provided visual input, as measured by BS using the generated answer.
  • Figure 5: Average similarity of each frame against all other frames from the same video. It shows a clear bidirectional 3-frame window of higher similarity.
  • ...and 2 more figures