Progress Constraints for Reinforcement Learning in Behavior Trees
Finn Rietz, Mart Kartašev, Johannes A. Stork, Petter Ögren
TL;DR
This work tackles unsafe, inefficient integration of reinforcement learning with Behavior Trees by introducing progress constraints grounded in BT convergence theory. The authors define feasibility-estimator-based action masks to preserve the BT's progression, train subtask estimators and controllers iteratively in a sequence that follows the intended task progression, and demonstrate improved safety, sample efficiency, and constraint satisfaction in both a 2D navigation task and a high-fidelity warehouse scenario. Key contributions include extending BT convergence results to general BTs, learning feasibility estimators for convergence regions, and presenting a CBTRL algorithm that yields modular, constraint-aware RL controllers within BTs. The results show that enforcing progress constraints during training avoids unsafe handoffs and reduces exploration of suboptimal actions, offering a practical route to reliable BT–RL in robotics and interactive systems.
Abstract
Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.
