Table of Contents
Fetching ...

LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application

Zhe Huang, John Pohovey, Ananya Yammanuru, Katherine Driggs-Campbell

TL;DR

This work proposes Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration.

Abstract

Large Language Models (LLM) and Vision Language Models (VLM) enable robots to ground natural language prompts into control actions to achieve tasks in an open world. However, when applied to a long-horizon collaborative task, this formulation results in excessive prompting for initiating or clarifying robot actions at every step of the task. We propose Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration. We demonstrate smooth coordination between a LIT-based collaborative robot and the human user in collaborative cooking tasks.

LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application

TL;DR

This work proposes Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration.

Abstract

Large Language Models (LLM) and Vision Language Models (VLM) enable robots to ground natural language prompts into control actions to achieve tasks in an open world. However, when applied to a long-horizon collaborative task, this formulation results in excessive prompting for initiating or clarifying robot actions at every step of the task. We propose Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration. We demonstrate smooth coordination between a LIT-based collaborative robot and the human user in collaborative cooking tasks.
Paper Structure (17 sections, 1 equation, 3 figures)

This paper contains 17 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Language-driven Intention Tracking (LIT) based collaborative robot framework. The open scene understanding module detects objects in the scene and generate potential manipulation options, which in our case are top-down grasp poses. The task graph reasoning module takes the user's prompt on the overall task and the detected objects as input to generate a list of task steps, which we define as intention in this work. As some steps of the overall task can switch order without impact on the outcome, the LLM checks on reversibility of sequences of task steps, and builds a task graph. The Language-driven Intention Tracking module uses the task graph to build the probabilistic graphical model for intention transition. The VLM is used to generate text descriptions from frames as measurements. We compute time-varying transition probabilities and make prediction steps, and use measurements to compute measurement likelihood and make update steps to track the human intention. The intention-grounded planning module make an additional prediction step on the current intention posterior, and manipulate the objects relevant to the predicted next intention to proactively collaborate with the human.
  • Figure 2: The graphical model for intention tracking. We denote the measurement of human behavior as $X_t$, and the human intention as $G_t$.
  • Figure 3: Language-driven Intention Tracking with different similarity metrics. The ground truth order of the human intentions: slice tomatoes; slice cucumbers; put tomatoes and cucumbers in a bowl; put salad dressing on tomatoes and cucumbers; stir and mix the salad with a spoon. Snapshots show the moment when intention transition happens. (a) The human starts cutting a cucumber after finishing cutting a tomato. (b) The human starts putting vegetables into a bowl after cutting the cucumber.