Table of Contents
Fetching ...

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Muhammet Furkan Ilaslan, Ali Koksal, Kevin Qinhong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

TL;DR

VG-TVP presents a zero-shot, multimodal procedural planning framework that jointly generates visually grounded textual plans and video plans from high-level goals. By integrating vanilla text planning, a Fusion of Captioning (FoC) module, and two bridges (V2T-B and T2V-B), the method achieves temporally coherent and accurate multimodal procedures, even for unseen tasks lacking instructional videos. A new Daily-PP dataset supports rigorous evaluation across seen and unseen tasks, showing VG-TVP often outperforms baselines and a TIP reference in textual informativeness, visual informativeness, and overall plan coherence. The work highlights the value of aligning video captions with text plans and demonstrates practical benefits for human learning tasks, with potential for broader multimodal instruction and education applications.

Abstract

Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

TL;DR

VG-TVP presents a zero-shot, multimodal procedural planning framework that jointly generates visually grounded textual plans and video plans from high-level goals. By integrating vanilla text planning, a Fusion of Captioning (FoC) module, and two bridges (V2T-B and T2V-B), the method achieves temporally coherent and accurate multimodal procedures, even for unseen tasks lacking instructional videos. A new Daily-PP dataset supports rigorous evaluation across seen and unseen tasks, showing VG-TVP often outperforms baselines and a TIP reference in textual informativeness, visual informativeness, and overall plan coherence. The work highlights the value of aligning video captions with text plans and demonstrates practical benefits for human learning tasks, with potential for broader multimodal instruction and education applications.

Abstract

Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.

Paper Structure

This paper contains 34 sections, 24 figures, 7 tables.

Figures (24)

  • Figure 1: VG-TVP generates MPP with multiple steps for a high-level goal, supplying textual and visual guidelines.
  • Figure 2: VG-TVP Model: Given the textual input and multiple instructional videos, VG-TVP generates visually grounded textual plans and video plans by using V2T-B and T2V-B. ChatGPT 3.5 is used to reorganize all captions to generate FoC.
  • Figure 3: FoC captures and fuses IVs' captions. Then, it injects into the system by aligning them with vanilla textual.
  • Figure 4: Impact of FoC on the task, "How to Fold the Presidential Pocket Square?".
  • Figure 5: Qualitative comparison between Llama2-13B-q8 Model and VG-T2V (Ours). Visuals (orange) are used to generate video plans. VG-TVP can increase the number of steps to generate the MPP more informative and accurate.
  • ...and 19 more figures