Table of Contents
Fetching ...

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Alessandro Adami, Tommaso Tubaldo, Marco Todescato, Ruggero Carli, Pietro Falco

Abstract

Vision-language models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in safety-critical robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. Real-world experiments on two robotic manipulators show that structured policies learned entirely from synthetic supervision transfer successfully to physical systems. The results indicate that foundation models can be adapted to produce interpretable and structured robot policies, providing an alternative to opaque end-to-end approaches for multimodal robot decision making.

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

Abstract

Vision-language models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in safety-critical robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. Real-world experiments on two robotic manipulators show that structured policies learned entirely from synthetic supervision transfer successfully to physical systems. The results indicate that foundation models can be adapted to produce interpretable and structured robot policies, providing an alternative to opaque end-to-end approaches for multimodal robot decision making.

Paper Structure

This paper contains 51 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the proposed framework. Given synthetic observations, a large foundation model is first used to automatically generate a synthetic supervision dataset composed of task instructions and corresponding Behavior Trees from visual observations and structured system specifications. This dataset is then used to fine-tune the Pixtral-12B vision-language model for constrained symbolic generation. At inference time, the fine-tuned model receives a real-world observation, a task instruction, and system specifications, and outputs a reactive Behavior Tree representing a structured robot policy for task execution.
  • Figure 2: Examples of synthetic tabletop scenes used in dataset generation.
  • Figure 3: Representation of the target Behavior Tree $\mathcal{T}$ in the prompt schema.
  • Figure 4: Example of real-world images, representing scenarios coherent with the synthetic dataset.
  • Figure 5: Real-world experimental platforms used to validate hardware-agnostic BT execution: the Franka Emika Panda and the UR5e.
  • ...and 1 more figures