Table of Contents
Fetching ...

Progress Constraints for Reinforcement Learning in Behavior Trees

Finn Rietz, Mart Kartašev, Johannes A. Stork, Petter Ögren

TL;DR

This work tackles unsafe, inefficient integration of reinforcement learning with Behavior Trees by introducing progress constraints grounded in BT convergence theory. The authors define feasibility-estimator-based action masks to preserve the BT's progression, train subtask estimators and controllers iteratively in a sequence that follows the intended task progression, and demonstrate improved safety, sample efficiency, and constraint satisfaction in both a 2D navigation task and a high-fidelity warehouse scenario. Key contributions include extending BT convergence results to general BTs, learning feasibility estimators for convergence regions, and presenting a CBTRL algorithm that yields modular, constraint-aware RL controllers within BTs. The results show that enforcing progress constraints during training avoids unsafe handoffs and reduces exploration of suboptimal actions, offering a practical route to reliable BT–RL in robotics and interactive systems.

Abstract

Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.

Progress Constraints for Reinforcement Learning in Behavior Trees

TL;DR

This work tackles unsafe, inefficient integration of reinforcement learning with Behavior Trees by introducing progress constraints grounded in BT convergence theory. The authors define feasibility-estimator-based action masks to preserve the BT's progression, train subtask estimators and controllers iteratively in a sequence that follows the intended task progression, and demonstrate improved safety, sample efficiency, and constraint satisfaction in both a 2D navigation task and a high-fidelity warehouse scenario. Key contributions include extending BT convergence results to general BTs, learning feasibility estimators for convergence regions, and presenting a CBTRL algorithm that yields modular, constraint-aware RL controllers within BTs. The results show that enforcing progress constraints during training avoids unsafe handoffs and reduces exploration of suboptimal actions, offering a practical route to reliable BT–RL in robotics and interactive systems.

Abstract

Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.
Paper Structure (20 sections, 1 theorem, 13 equations, 6 figures, 1 table)

This paper contains 20 sections, 1 theorem, 13 equations, 6 figures, 1 table.

Key Result

Theorem 1

(Convergence of BTs, from ogrenBehaviorTreesRobot2022) Given the set of BT leaf nodes $J = \{1,...,j\}$, let If there exists a re-labeling of the nodes such that for all $i \in J$, the convergence set $C_i$ is invariant under $\pi_i$ and there exists a time-horizon $h$, such that if $\mathbf{x}_t \in \Omega_i$ then $\mathbf{x}_{t+h}\notin \Omega_i$. Then, the state $\mathbf{x}$ will enter the suc

Figures (6)

  • Figure 1: Top: High-fidelity warehouse environment. The agent, a small mobile manipulator, must collect and deliver items while avoiding collisions with the bigger, dynamic forklift with unknown dynamics. Bottom: Corresponding Behavior Tree with RL controllers $\pi_1$, $\pi_2$, and $\pi_3$. Naïvely learning the controllers with RL can result in unsafe or greedy behavior, while our method accounts for long-term BT progress constraints. Note that the lower part of the BT applies the Implicit Sequence design principle colledanchiseBehaviorTreesRobotics2017.
  • Figure 2: Visualization of operating regions, a controller, and its convergence set. The controller $\pi_2$ may leave its operating region $\Omega_2$ but must remain in its convergence set $C_2$.
  • Figure 3: 2D goal-reach environment. Top: Task-switching BT with RL-based controllers. The BT selects the goal controller $\pi_2$, when the state is in its convergence set $C_2$, which corresponds to states outside of the unsafe region and the slope. Bottom: Trajectories generated by different methods. Our method converges to a near-optimal behavior for the BT task of reaching the goal without entering the unsafe region. Note that with our method, $\pi_2$ never violates the progress constraint "safe" during learning.
  • Figure 4: Empirical results on 2D navigation environment. We plot the mean of five repetitions with varying random seeds, the shaded area corresponds to one standard deviation around the mean. To enhance readability, the x-axis is linearly scaled in the range $[0, 10^5]$ and logarithmically scaled from $10^5$ to $10^6$.
  • Figure 5: Left: Learned state feasibility function for the unsafe region in the 2D environment accounts for both agent and environment dynamics. Right: Induced action space mask $\mathcal{U}_1$ with the agent placed at and velocities as in the top image on the left. Even though the safety constraint violation only occurs in the unsafe area, the constraint prevents the agent from stepping onto the slope.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 1: Behavior Tree
  • Definition 2
  • Definition 3
  • Theorem 1
  • Remark 1: Intended order of progression
  • Remark 2
  • Remark 3: Customization