Table of Contents
Fetching ...

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

Diogo Glória-Silva, David Semedo, João Maglhães

TL;DR

This work introduces VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans, and shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting.

Abstract

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

TL;DR

This work introduces VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans, and shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting.

Abstract

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
Paper Structure (44 sections, 6 equations, 5 figures, 23 tables, 1 algorithm)

This paper contains 44 sections, 6 equations, 5 figures, 23 tables, 1 algorithm.

Figures (5)

  • Figure 1: VIGiA is grounded on complex multimodal instructional plans, delivering unified multimodal alignment over dialogue turns by providing text-based guidance (Goal 1), plan-aware visual question answering (Goal 2) aligning plan actions and visual context (Goal 3), and retrieving relevant video moments (Goal 4).
  • Figure 2: VIGiA is an LVLM model that processes instructional video plans to navigate through steps of the plan, perform QA, VQA and retrieval of arbitrarily random text steps or video moment steps.
  • Figure 3: Global view of VIGiA's architecture. To handle multimodal inputs VIGiA combines a visual encoder and an LLM using an MLP as a connector module. For conversational video moment retrieval, VIGiA outputs a dedicated video moment start and end representation that can be used for start and end frame retrieval.
  • Figure 4: Comparing R@1 performance with varying values for the similarity threshold on InstructionVidDial's dev set.
  • Figure 5: Six examples of how the similarity of the start and end retrieval representations with the video frames varies throughout the video.