Table of Contents
Fetching ...

Cooking Task Planning using LLM and Verified by Graph Network

Ryunosuke Takebayashi, Vitor Hideyo Isume, Takuya Kiyokawa, Weiwei Wan, Kensuke Harada

TL;DR

The paper tackles the challenge of converting cooking videos into robot-friendly task plans by marrying a multimodal LLM with the FOON graph-based reasoning framework. The method translates subtitled video scenes into target object states, then derives and validates a sequence of actions via FOON functional units, with iterative replanning to fix inconsistencies. Motion planning is performed separately to ensure feasible grasps and trajectories, enabling real-world execution on a dual-armed robot; results show 4 of 5 full task graphs succeed, and 86% target object state estimation accuracy. This approach reduces hallucination risks in LLM outputs by structural validation and yields environment-independent planning suitable for complex, long-horizon cooking tasks.

Abstract

Cooking tasks remain a challenging problem for robotics due to their complexity. Videos of people cooking are a valuable source of information for such task, but introduces a lot of variability in terms of how to translate this data to a robotic environment. This research aims to streamline this process, focusing on the task plan generation step, by using a Large Language Model (LLM)-based Task and Motion Planning (TAMP) framework to autonomously generate cooking task plans from videos with subtitles, and execute them. Conventional LLM-based task planning methods are not well-suited for interpreting the cooking video data due to uncertainty in the videos, and the risk of hallucination in its output. To address both of these problems, we explore using LLMs in combination with Functional Object-Oriented Networks (FOON), to validate the plan and provide feedback in case of failure. This combination can generate task sequences with manipulation motions that are logically correct and executable by a robot. We compare the execution of the generated plans for 5 cooking recipes from our approach against the plans generated by a few-shot LLM-only approach for a dual-arm robot setup. It could successfully execute 4 of the plans generated by our approach, whereas only 1 of the plans generated by solely using the LLM could be executed.

Cooking Task Planning using LLM and Verified by Graph Network

TL;DR

The paper tackles the challenge of converting cooking videos into robot-friendly task plans by marrying a multimodal LLM with the FOON graph-based reasoning framework. The method translates subtitled video scenes into target object states, then derives and validates a sequence of actions via FOON functional units, with iterative replanning to fix inconsistencies. Motion planning is performed separately to ensure feasible grasps and trajectories, enabling real-world execution on a dual-armed robot; results show 4 of 5 full task graphs succeed, and 86% target object state estimation accuracy. This approach reduces hallucination risks in LLM outputs by structural validation and yields environment-independent planning suitable for complex, long-horizon cooking tasks.

Abstract

Cooking tasks remain a challenging problem for robotics due to their complexity. Videos of people cooking are a valuable source of information for such task, but introduces a lot of variability in terms of how to translate this data to a robotic environment. This research aims to streamline this process, focusing on the task plan generation step, by using a Large Language Model (LLM)-based Task and Motion Planning (TAMP) framework to autonomously generate cooking task plans from videos with subtitles, and execute them. Conventional LLM-based task planning methods are not well-suited for interpreting the cooking video data due to uncertainty in the videos, and the risk of hallucination in its output. To address both of these problems, we explore using LLMs in combination with Functional Object-Oriented Networks (FOON), to validate the plan and provide feedback in case of failure. This combination can generate task sequences with manipulation motions that are logically correct and executable by a robot. We compare the execution of the generated plans for 5 cooking recipes from our approach against the plans generated by a few-shot LLM-only approach for a dual-arm robot setup. It could successfully execute 4 of the plans generated by our approach, whereas only 1 of the plans generated by solely using the LLM could be executed.

Paper Structure

This paper contains 24 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Our proposed framework to generate a cooking task plan for a robot from a video, using LLMs for reasoning and a task graph structure for logical validation.
  • Figure 2: Overview of our proposed method.
  • Figure 3: Example of a functional unit for a "Pick" motion.
  • Figure 4: Additional types of Object nodes proposed for this task.
  • Figure 5: Functional unit with variables(Pick). The fields with the interrogation mark are filled with the values specified for the variables when instantiated.
  • ...and 5 more figures