Table of Contents
Fetching ...

Adaptive Teaching in Heterogeneous Agents: Balancing Surprise in Sparse Reward Scenarios

Emma Clark, Kanghyun Ryu, Negar Mehr

TL;DR

This work tackles teaching when Teacher and Student are heterogeneous and operate under sparse rewards. It introduces a surprise-based Teacher-Student framework where the Teacher maximizes its own surprise to explore while minimizing the Student's surprise through tailored demonstrations, implemented via an intrinsic reward $r_i(s,a)$ that combines $D_{KL}(P(\,\cdot|s,a)||P_{phi_T}(\cdot|s,a))$ and $D_{KL}(P_{phi_T}(\cdot|s,a)||P_{phi_S}(\cdot|s,a))$ with weights $\eta_T$ and $\eta_S$, respectively. The Teacher is trained with TRPO and its transitions are modeled as Gaussian $P_{phi_T}$, while the Student uses Behavioral Cloning from the Teacher's demonstrations. Across Mountain Car, Cart Pole Swing Up, and sparse Half Cheetah, the method yields higher Student rewards in heterogeneous settings than a surprise-maximization baseline and maintains performance in homogeneous settings. This approach enables robust cross-agent teaching in systems with differing dynamics or constraints, potentially enhancing data efficiency and transfer in real-world multi-agent tasks.

Abstract

Learning from Demonstration (LfD) can be an efficient way to train systems with analogous agents by enabling ``Student'' agents to learn from the demonstrations of the most experienced ``Teacher'' agent, instead of training their policy in parallel. However, when there are discrepancies in agent capabilities, such as divergent actuator power or joint angle constraints, naively replicating demonstrations that are out of bounds for the Student's capability can limit efficient learning. We present a Teacher-Student learning framework specifically tailored to address the challenge of heterogeneity between the Teacher and Student agents. Our framework is based on the concept of ``surprise'', inspired by its application in exploration incentivization in sparse-reward environments. Surprise is repurposed to enable the Teacher to detect and adapt to differences between itself and the Student. By focusing on maximizing its surprise in response to the environment while concurrently minimizing the Student's surprise in response to the demonstrations, the Teacher agent can effectively tailor its demonstrations to the Student's specific capabilities and constraints. We validate our method by demonstrating improvements in the Student's learning in control tasks within sparse-reward environments.

Adaptive Teaching in Heterogeneous Agents: Balancing Surprise in Sparse Reward Scenarios

TL;DR

This work tackles teaching when Teacher and Student are heterogeneous and operate under sparse rewards. It introduces a surprise-based Teacher-Student framework where the Teacher maximizes its own surprise to explore while minimizing the Student's surprise through tailored demonstrations, implemented via an intrinsic reward that combines and with weights and , respectively. The Teacher is trained with TRPO and its transitions are modeled as Gaussian , while the Student uses Behavioral Cloning from the Teacher's demonstrations. Across Mountain Car, Cart Pole Swing Up, and sparse Half Cheetah, the method yields higher Student rewards in heterogeneous settings than a surprise-maximization baseline and maintains performance in homogeneous settings. This approach enables robust cross-agent teaching in systems with differing dynamics or constraints, potentially enhancing data efficiency and transfer in real-world multi-agent tasks.

Abstract

Learning from Demonstration (LfD) can be an efficient way to train systems with analogous agents by enabling ``Student'' agents to learn from the demonstrations of the most experienced ``Teacher'' agent, instead of training their policy in parallel. However, when there are discrepancies in agent capabilities, such as divergent actuator power or joint angle constraints, naively replicating demonstrations that are out of bounds for the Student's capability can limit efficient learning. We present a Teacher-Student learning framework specifically tailored to address the challenge of heterogeneity between the Teacher and Student agents. Our framework is based on the concept of ``surprise'', inspired by its application in exploration incentivization in sparse-reward environments. Surprise is repurposed to enable the Teacher to detect and adapt to differences between itself and the Student. By focusing on maximizing its surprise in response to the environment while concurrently minimizing the Student's surprise in response to the demonstrations, the Teacher agent can effectively tailor its demonstrations to the Student's specific capabilities and constraints. We validate our method by demonstrating improvements in the Student's learning in control tasks within sparse-reward environments.
Paper Structure (14 sections, 8 equations, 5 figures)

This paper contains 14 sections, 8 equations, 5 figures.

Figures (5)

  • Figure 1: Overview of our Teacher-Student framework
  • Figure 2: Mean and standard deviation of reward for the three environments are shown. Results are from 5 random seeds for Mountain Car and Half Cheetah and 8 random seeds for Cart Pole Swing Up. Our algorithm is deployed in a setting where the Teacher and Student are in the same environment. Baselines are trained in a single-agent setting where they are trained without the Teacher. Both Teacher and Student in our Teacher-Student framework can learn successful policy in sparse-reward environments.
  • Figure 3: Teacher and Student training results where each agent has different constraints or dynamics. While the average reward of the Teacher is similar for both methods, the Student learning from our Teacher achieves higher average rewards. These show that our method can provide better demonstrations for the Student with different constraints/dynamics.
  • Figure 4: Teacher demonstration for Mountain Car environment where the Student has less power available than the Teacher. In training epoch 0, both methods appear to be similarly random. In the 100th epoch, our method begins to exhibit larger forces corresponding to the f-axis on the figures. At the end of training, we see there is a clear distinction between the forces exhibited by the two methods. Our method adapts to the low-power dynamics of the Student environment by demonstrating much larger forces compared to the surprise maximization algorithm.
  • Figure 5: Training results in a sparse Half Cheetah environment with varying weights on Student surprise. The performance gap of the Student widens with an increased weight on Student surprise. This suggests that placing greater emphasis on Student surprise leads the Teacher to provide demonstrations that are more easily followed by the Student.