Table of Contents
Fetching ...

VLM-driven Behavior Tree for Context-aware Task Planning

Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Kazuhiro Sasabuchi, Katsushi Ikeuchi

TL;DR

This work tackles the challenge of rapidly programming context-aware robot behavior in visually diverse environments. It introduces a framework that uses Vision-Language Models to generate BTs with self-prompted visual conditions, enabling runtime decisions based on real-time imagery, and pairs this with an interactive BT editor for safety and transparency. The approach is validated through a real-world cafe demonstration and an end-to-end robot experiment, showing feasible task execution and the ability to generate diverse BTs across cafe scenarios. The project contributes an end-to-end prompt-based BT generation pipeline, an interactive visualization/editing interface, and publicly available code guidelines to advance robot programming for domain experts without deep coding expertise.

Abstract

The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.

VLM-driven Behavior Tree for Context-aware Task Planning

TL;DR

This work tackles the challenge of rapidly programming context-aware robot behavior in visually diverse environments. It introduces a framework that uses Vision-Language Models to generate BTs with self-prompted visual conditions, enabling runtime decisions based on real-time imagery, and pairs this with an interactive BT editor for safety and transparency. The approach is validated through a real-world cafe demonstration and an end-to-end robot experiment, showing feasible task execution and the ability to generate diverse BTs across cafe scenarios. The project contributes an end-to-end prompt-based BT generation pipeline, an interactive visualization/editing interface, and publicly available code guidelines to advance robot programming for domain experts without deep coding expertise.

Abstract

The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
Paper Structure (18 sections, 11 figures, 5 tables)

This paper contains 18 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: We propose a user-friendly robot system that enables domain experts to program robots interactively. The system takes user instructions, scene details, map information, and the robot's skill set as inputs, and converts them into a visual program represented as a Behavior Tree (BT). The BT incorporates visual condition nodes that dynamically switch the robot's behavior based on real-world images during execution.
  • Figure 2: An example of a role prompt.
  • Figure 3: An example of an environment prompt.
  • Figure 4: An example of an output prompt.
  • Figure 5: An example of an action prompt.
  • ...and 6 more figures