VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva; David Semedo; João Maglhães

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães

TL;DR

This work introduces VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans, and shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting.

Abstract

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

TL;DR

Abstract

Paper Structure (44 sections, 6 equations, 5 figures, 23 tables, 1 algorithm)

This paper contains 44 sections, 6 equations, 5 figures, 23 tables, 1 algorithm.

Introduction
Related Work
Definitions and Problem Formulation
Methodology
Multimodal Plan Guidance LVLMs
Plan-grounded reasoning capabilities
Plan-Grounded Answer Generation.
Plan-aware Visual Question Answering (pVQA).
Plan-step retrieval capabilities
Visually-Informed Step Generation (VSG).
Conversational Video Moment Retrieval.
Model Architecture
Training
InstructionVidDial Dataset
Experimental Setup
...and 29 more sections

Figures (5)

Figure 1: VIGiA is grounded on complex multimodal instructional plans, delivering unified multimodal alignment over dialogue turns by providing text-based guidance (Goal 1), plan-aware visual question answering (Goal 2) aligning plan actions and visual context (Goal 3), and retrieving relevant video moments (Goal 4).
Figure 2: VIGiA is an LVLM model that processes instructional video plans to navigate through steps of the plan, perform QA, VQA and retrieval of arbitrarily random text steps or video moment steps.
Figure 3: Global view of VIGiA's architecture. To handle multimodal inputs VIGiA combines a visual encoder and an LLM using an MLP as a connector module. For conversational video moment retrieval, VIGiA outputs a dedicated video moment start and end representation that can be used for start and end frame retrieval.
Figure 4: Comparing R@1 performance with varying values for the similarity threshold on InstructionVidDial's dev set.
Figure 5: Six examples of how the similarity of the start and end retrieval representations with the video frames varies throughout the video.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

TL;DR

Abstract

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (5)