Swarm Behavior Cloning
Jonas Nüßlein, Maximilian Zorn, Philipp Altmann, Claudia Linnhoff-Popien
TL;DR
Swarm Behavior Cloning (Swarm BC) tackles the problem of divergent action predictions among ensemble BC policies in offline imitation learning. By adding a regularization term that encourages alignment of hidden feature representations across ensemble members, Swarm BC reduces action divergence while preserving diversity, leading to more reliable aggregated actions. Empirical evaluation on eight OpenAI Gym environments shows consistent performance gains and reduced mean action differences, particularly in higher-dimensional spaces. Theoretical analysis connects the training objective to concentrating probability mass on the global hidden-feature mode, offering a principled justification for the observed improvements and suggesting strong practical impact for offline imitation and ensemble methods.
Abstract
In sequential decision-making environments, the primary approaches for training agents are Reinforcement Learning (RL) and Imitation Learning (IL). Unlike RL, which relies on modeling a reward function, IL leverages expert demonstrations, where an expert policy $π_e$ (e.g., a human) provides the desired behavior. Formally, a dataset $D$ of state-action pairs is provided: $D = {(s, a = π_e(s))}$. A common technique within IL is Behavior Cloning (BC), where a policy $π(s) = a$ is learned through supervised learning on $D$. Further improvements can be achieved by using an ensemble of $N$ individually trained BC policies, denoted as $E = {π_i(s)}{1 \leq i \leq N}$. The ensemble's action $a$ for a given state $s$ is the aggregated output of the $N$ actions: $a = \frac{1}{N} \sum{i} π_i(s)$. This paper addresses the issue of increasing action differences -- the observation that discrepancies between the $N$ predicted actions grow in states that are underrepresented in the training data. Large action differences can result in suboptimal aggregated actions. To address this, we propose a method that fosters greater alignment among the policies while preserving the diversity of their computations. This approach reduces action differences and ensures that the ensemble retains its inherent strengths, such as robustness and varied decision-making. We evaluate our approach across eight diverse environments, demonstrating a notable decrease in action differences and significant improvements in overall performance, as measured by mean episode returns.
