Table of Contents
Fetching ...

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, Xitong Yang

TL;DR

VidAssist is introduced, an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos that leverages large language models as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets.

Abstract

Goal-oriented planning, or anticipating a series of actions that transition an agent from its current state to a predefined objective, is crucial for developing intelligent assistants aiding users in daily procedural tasks. The problem presents significant challenges due to the need for comprehensive knowledge of temporal and hierarchical task structures, as well as strong capabilities in reasoning and planning. To achieve this, prior work typically relies on extensive training on the target dataset, which often results in significant dataset bias and a lack of generalization to unseen tasks. In this work, we introduce VidAssist, an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos. VidAssist leverages large language models (LLMs) as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets. Moreover, VidAssist employs a breadth-first search algorithm for optimal plan generation, in which a composite of value functions designed for goal-oriented planning is utilized to assess the predicted actions at each step. Extensive experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups, e.g., visual planning for assistance (VPA) and procedural planning (PP), and achieves remarkable performance in zero-shot and few-shot setups. Specifically, our few-shot model outperforms the prior fully supervised state-of-the-art method by +7.7% in VPA and +4.81% PP task on the COIN dataset while predicting 4 future actions. Code, and models are publicly available at https://sites.google.com/view/vidassist.

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

TL;DR

VidAssist is introduced, an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos that leverages large language models as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets.

Abstract

Goal-oriented planning, or anticipating a series of actions that transition an agent from its current state to a predefined objective, is crucial for developing intelligent assistants aiding users in daily procedural tasks. The problem presents significant challenges due to the need for comprehensive knowledge of temporal and hierarchical task structures, as well as strong capabilities in reasoning and planning. To achieve this, prior work typically relies on extensive training on the target dataset, which often results in significant dataset bias and a lack of generalization to unseen tasks. In this work, we introduce VidAssist, an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos. VidAssist leverages large language models (LLMs) as both the knowledge base and the assessment tool for generating and evaluating action plans, thus overcoming the challenges of acquiring procedural knowledge from small-scale, low-diversity datasets. Moreover, VidAssist employs a breadth-first search algorithm for optimal plan generation, in which a composite of value functions designed for goal-oriented planning is utilized to assess the predicted actions at each step. Extensive experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups, e.g., visual planning for assistance (VPA) and procedural planning (PP), and achieves remarkable performance in zero-shot and few-shot setups. Specifically, our few-shot model outperforms the prior fully supervised state-of-the-art method by +7.7% in VPA and +4.81% PP task on the COIN dataset while predicting 4 future actions. Code, and models are publicly available at https://sites.google.com/view/vidassist.
Paper Structure (37 sections, 6 equations, 9 figures, 8 tables)

This paper contains 37 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Goal-oriented planning aims to generate action plans to achieve a given goal based on the visual observations. The task unifies two setups that have been independently explored in prior literature: Visual Planning for Assistance patel2023pretrained (left) and Procedural Planning chang2020procedurebi2021proceduresun2022platezhao2022p3ivwang2023eventliu2023language (right).
  • Figure 2: Overview of the VidAssist framework. We first process visual observations and goals by transforming the visual inputs into textual descriptions using visual understanding models. Then, we leverage a search-based approach for optimal plan generation: at each step, we propose$K$ probable subsequent actions and assess them using a composite of value functions specifically designed for goal-oriented planning. LLMs are employed as both the knowledge base and the assessment tool in the search process, and we illustrate the details in Fig. \ref{['fig:prompt']}.
  • Figure 3: Templates and examples of the LLM prompts we design for subsequent action proposal (left) and partial plan evaluation (right).
  • Figure S1: Example of visual planning for assistance in COIN dataset. (a) VidAssist successfully predicts the future action steps while the LLM baseline fails. (b) Visualization of the proposed search technique with intermediate steps and value scores. We only show three generated actions at each step for brevity and clarity.
  • Figure S2: Example of visual planning for assistance in CrossTask dataset. (a) VidAssist successfully predicts the future action steps while the LLM baseline fails. (b) Visualization of the proposed search technique with intermediate steps and value scores. We only show three generated actions at each step for brevity and clarity.
  • ...and 4 more figures