Table of Contents
Fetching ...

On the Fly Adaptation of Behavior Tree-Based Policies through Reinforcement Learning

Marco Iannotta, Johannes A. Stork, Erik Schaffernicht, Todor Stoyanov

TL;DR

The paper tackles adapting Behavior Tree–based policies to local task variations in dynamic manufacturing. It introduces a hierarchical, context-conditioned reinforcement learning framework where an upper-level policy $\pi^{up}_{\boldsymbol{\omega}}$ selects BT parameters $\hat{\boldsymbol{\theta}}$ for a lower-level BT policy $\pi^{low}_{\boldsymbol{\theta}}$, guided by episodic context $\boldsymbol{c}$. Through online RL (SAC) with a replay buffer, the approach achieves fast convergence and generalization across increasingly many task variations, demonstrated both in simulation (Obstacle Avoidance) and on a real Franka Panda (Pivoting). The results show that sharing experience across task variants enables scalable training and improved performance compared with baselines, while preserving the interpretability and safety benefits of BTs. Limitations include fixed (non-learnable) Condition Nodes, with future work aiming to incorporate learnable conditions for even greater adaptability.

Abstract

With the rising demand for flexible manufacturing, robots are increasingly expected to operate in dynamic environments where local -- such as slight offsets or size differences in workpieces -- are common. We propose to address the problem of adapting robot behaviors to these task variations with a sample-efficient hierarchical reinforcement learning approach adapting Behavior Tree (BT)-based policies. We maintain the core BT properties as an interpretable, modular framework for structuring reactive behaviors, but extend their use beyond static tasks by inherently accommodating local task variations. To show the efficiency and effectiveness of our approach, we conduct experiments both in simulation and on a Franka Emika Panda 7-DoF, with the manipulator adapting to different obstacle avoidance and pivoting tasks.

On the Fly Adaptation of Behavior Tree-Based Policies through Reinforcement Learning

TL;DR

The paper tackles adapting Behavior Tree–based policies to local task variations in dynamic manufacturing. It introduces a hierarchical, context-conditioned reinforcement learning framework where an upper-level policy selects BT parameters for a lower-level BT policy , guided by episodic context . Through online RL (SAC) with a replay buffer, the approach achieves fast convergence and generalization across increasingly many task variations, demonstrated both in simulation (Obstacle Avoidance) and on a real Franka Panda (Pivoting). The results show that sharing experience across task variants enables scalable training and improved performance compared with baselines, while preserving the interpretability and safety benefits of BTs. Limitations include fixed (non-learnable) Condition Nodes, with future work aiming to incorporate learnable conditions for even greater adaptability.

Abstract

With the rising demand for flexible manufacturing, robots are increasingly expected to operate in dynamic environments where local -- such as slight offsets or size differences in workpieces -- are common. We propose to address the problem of adapting robot behaviors to these task variations with a sample-efficient hierarchical reinforcement learning approach adapting Behavior Tree (BT)-based policies. We maintain the core BT properties as an interpretable, modular framework for structuring reactive behaviors, but extend their use beyond static tasks by inherently accommodating local task variations. To show the efficiency and effectiveness of our approach, we conduct experiments both in simulation and on a Franka Emika Panda 7-DoF, with the manipulator adapting to different obstacle avoidance and pivoting tasks.

Paper Structure

This paper contains 12 sections, 8 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our method for learning an agent that adapts to task variations. We propose a hierarchical approach, where an upper-level policy $\pi^{\mathit{up}}_{\boldsymbol{\omega}}$ selects a set of parameters that are used by a lower-level BT-based policy $\pi^{\mathit{low}}_{\boldsymbol{\theta}}$ to control the robot. The upper-level policy is conditioned on a context vector $\boldsymbol{c}$ encoding task variations, to adjust the robot's behavior accordingly. The environment provides an episodic context $\boldsymbol{c}_e$ and the state $\boldsymbol{x}_t$. $\pi^{\mathit{low}}_{\boldsymbol{\theta}}$ selects an Action Node $a^i$ to execute ($\textbf{1}$), and queries $\pi^{\mathit{up}}_{\boldsymbol{\omega}}$ for parameters $\boldsymbol{\theta}^i$ ($\textbf{2}$ and $\textbf{3}$). Then, $a^i$ with parameters $\boldsymbol{\theta}^i$ is executed until completion or until the BT halts it ($\textbf{4}$). The environment provides a reward $\boldsymbol{r}_t$, which is used to update the parameters $\boldsymbol{\omega}$ of the upper-level policy and the process is repeated.
  • Figure 2: Obstacle avoidance task. \ref{['fig:obstacle_avoidance_task']} The objective for a robotic arm is to move its end-effector between predetermined start and goal positions ($S$ and $G$ respectively) while avoiding a static obstacle. Task variations arise from different obstacle heights $h_o$, widths $w_o$, and positions in a horizontal direction $x_o$. \ref{['fig:obstacle_avoidance_bt']} We design a BT policy $\pi^{bt}_{\boldsymbol{\theta}}$ with $4$ Action Nodes, each performing a linear motion. $\Delta x_i$ and $\Delta z_i$ are the goal relative coordinates for each motion w.r.t the current position. The last Action Node is not parameterized, as the goal location $G$ is predetermined. R denotes a reactive control node bts_book, continuously checking for collisions during motion.
  • Figure 3: Learning curves for the obstacle avoidance task obtained by periodically evaluating policies on all training contexts. The solid line and shaded region represent the mean and standard deviation, respectively ($10$ replicates). \ref{['fig:obstacle-avoidance-curves']} compares our BT-based policy with a standard SAC policy on an increasing number of training contexts. The dashed line indicates our policy convergence (i.e., reward improvement over the last $150$ episodes < $2\%$). \ref{['fig:obstacle-avoidance-curves-episode']} compares our step-based upper-level policy with an episode-based one on an increasing number of intermediate goals (ANs in \ref{['fig:obstacle_avoidance_bt']}).
  • Figure 4: Illustration of an object pivoting task being executed on a Franka Emika Panda 7-DoF manipulator. We perform 4 motions in the $x$-$z$ plane, by commanding goal relative coordinates w.r.t. the current end-effector position. Solid arrows denote directions along which the goal relative coordinate is learned. The grey dashed line shows the overall trajectory.