VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Muhammet Furkan Ilaslan; Ali Koksal; Kevin Qinhong Lin; Burak Satar; Mike Zheng Shou; Qianli Xu

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Muhammet Furkan Ilaslan, Ali Koksal, Kevin Qinhong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

TL;DR

VG-TVP presents a zero-shot, multimodal procedural planning framework that jointly generates visually grounded textual plans and video plans from high-level goals. By integrating vanilla text planning, a Fusion of Captioning (FoC) module, and two bridges (V2T-B and T2V-B), the method achieves temporally coherent and accurate multimodal procedures, even for unseen tasks lacking instructional videos. A new Daily-PP dataset supports rigorous evaluation across seen and unseen tasks, showing VG-TVP often outperforms baselines and a TIP reference in textual informativeness, visual informativeness, and overall plan coherence. The work highlights the value of aligning video captions with text plans and demonstrates practical benefits for human learning tasks, with potential for broader multimodal instruction and education applications.

Abstract

Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

TL;DR

Abstract

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)