Table of Contents
Fetching ...

Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning

Cristiano Battistini, Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

TL;DR

This work deploys a compact, open-source multimodal model to generate behavior trees for robotic task planning, and proposes a method to construct such a dataset starting from existing robotic episodes, in which a large model serves as a teacher in a multi-stage generation pipeline.

Abstract

Large and small language models have been widely used for robotic task planning. At the same time, vision-language models (VLMs) have successfully tackled problems such as image captioning, scene understanding, and visual question answering. In this work, we combine these two approaches by deploying a compact, open-source multimodal model to generate behavior trees for robotic task planning. The main obstacle to achieving this goal is the lack of an existing dataset that links visual observations and instructions to executable behavior trees. We propose a method to construct such a dataset starting from existing robotic episodes (i.e., Open X-Embodiment), in which a large model serves as a teacher in a multi-stage generation pipeline. We use this dataset to fine-tune VLMs ranging from 500M to 4B parameters via parameter-efficient fine-tuning (PEFT). The generated behavior trees, compatible with the BehaviorTree.CPP library, are evaluated both offline, using structural and lexical metrics, and online through the execution of household tasks in a state-of-the-art embodied simulator. Our results demonstrate that our fine-tuned 4B-parameter VLM approaches the performance of state-of-the-art closed-source models, achieving an 87\% success rate while requiring only a fraction of the computational resources.

Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning

TL;DR

This work deploys a compact, open-source multimodal model to generate behavior trees for robotic task planning, and proposes a method to construct such a dataset starting from existing robotic episodes, in which a large model serves as a teacher in a multi-stage generation pipeline.

Abstract

Large and small language models have been widely used for robotic task planning. At the same time, vision-language models (VLMs) have successfully tackled problems such as image captioning, scene understanding, and visual question answering. In this work, we combine these two approaches by deploying a compact, open-source multimodal model to generate behavior trees for robotic task planning. The main obstacle to achieving this goal is the lack of an existing dataset that links visual observations and instructions to executable behavior trees. We propose a method to construct such a dataset starting from existing robotic episodes (i.e., Open X-Embodiment), in which a large model serves as a teacher in a multi-stage generation pipeline. We use this dataset to fine-tune VLMs ranging from 500M to 4B parameters via parameter-efficient fine-tuning (PEFT). The generated behavior trees, compatible with the BehaviorTree.CPP library, are evaluated both offline, using structural and lexical metrics, and online through the execution of household tasks in a state-of-the-art embodied simulator. Our results demonstrate that our fine-tuned 4B-parameter VLM approaches the performance of state-of-the-art closed-source models, achieving an 87\% success rate while requiring only a fraction of the computational resources.
Paper Structure (12 sections, 3 equations, 8 figures, 4 tables)

This paper contains 12 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of our pipeline. We derive image instruction pairs from Open X-Embodiment and use a multi-stage teacher pipeline to generate a multimodal behavior tree dataset. Compact VLMs are then fine-tuned with QLoRA to output BTs compatible with BehaviorTree.CPP, using a single RGB image and a natural language instruction. Finally, generated BTs are executed in OmniGibson on the BEHAVIOR-1K household tasks to assess planning success.
  • Figure 2: BT generation methods by input modality and model scale. The shaded quadrant marks the research gap addressed by this work.
  • Figure 3: Dataset generation pipeline. Starting from an Open X-Embodiment subset (1,622 episodes), we build a $3{\times}3$ frame sheet as a temporally sparse episode summary and use GPT-5-mini to generate Scene Analysis and a linear BT XML plan, retrying the Architect step until it passes the Conformance Validator. This yields the base dataset (1,622 episodes). We then apply structural augmentation to 50% of base episodes (+811), obtaining 2,433 episodes, and apply lexical augmentation to the resulting set by replacing action names with the same probability.
  • Figure 4: Example of a $3{\times}3$ frame sheet provided to the teacher model (frames are ordered temporally).
  • Figure 5: Training sample example. The user turn provides the image and the task instruction; the assistant's turn contains the state analysis YAML followed by the XML BT.
  • ...and 3 more figures