Nested Training for Mutual Adaptation in Human-AI Teaming

Upasana Biswas; Durgesh Kalwar; Subbarao Kambhampati; Sarath Sreedharan

Nested Training for Mutual Adaptation in Human-AI Teaming

Upasana Biswas, Durgesh Kalwar, Subbarao Kambhampati, Sarath Sreedharan

TL;DR

This work proposes a nested training regime to approximately learn the solution to a finite-level I-POMDP, where agents at each level are trained against adaptive agents from the level below, to ensure that the ego agent is exposed to adaptive behavior during training while avoiding the emergence of implicit coordination strategies.

Abstract

Mutual adaptation is a central challenge in human--AI teaming, as humans naturally adjust their strategies in response to a robot's policy. Existing approaches aim to improve diversity in training partners to approximate human behavior, but these partners are static and fail to capture adaptive behavior of humans. Exposing robots to adaptive behaviors is critical, yet when both agents learn simultaneously in a multi-agent setting, they often converge to opaque implicit coordination strategies that only work with the agents they were co-trained with. Such agents fail to generalize when paired with new partners. In order to capture the adaptive behavior of humans, we model the human-robot teaming scenario as an Interactive Partially Observable Markov Decision Process (I-POMDP), explicitly modeling human adaptation as part of the state. We propose a nested training regime to approximately learn the solution to a finite-level I-POMDP. In this framework, agents at each level are trained against adaptive agents from the level below. This ensures that the ego agent is exposed to adaptive behavior during training while avoiding the emergence of implicit coordination strategies, since the training partners are not themselves learning. We train our method in a multi-episode, required cooperation setup in the Overcooked domain, comparing it against several baseline agents designed for human-robot teaming. We evaluate the performance of our agent when paired with adaptive partners that were not seen during training. Our results demonstrate that our agent not only achieves higher task performance with these adaptive partners but also exhibits significantly greater adaptability during team interactions.

Nested Training for Mutual Adaptation in Human-AI Teaming

TL;DR

Abstract

Paper Structure (15 sections, 1 theorem, 1 equation, 3 figures, 3 tables)

This paper contains 15 sections, 1 theorem, 1 equation, 3 figures, 3 tables.

Introduction
Methodology
Results
Conclusion
Appendix
Environment and Task Setup
Observation Space
Action Space
Reward Structure
Peer Pool Generation
I-POMDP
Proof
Qualitative Analysis of Adaptive Behavior
Statistical Testing
Training Details

Key Result

theorem 1

Consider a finite Markov game $(\mathcal{S}, \\ \{\mathcal{A}_i\}_{i \in \mathcal{I}}, P, \{r_i\}_{i \in \mathcal{I}})$ admitting a set of joint solution policies $\Pi^* = \{\pi^1, \dots, \pi^m\}$, where each $\pi^j$ represents a distinct joint policy. Agents are trained via a nested process in whic Then, under nested training regime, for every $n \ge 1$, the learned policy $\pi_n$ does not collap

Figures (3)

Figure 1: Overview of the nested training regime. Level-1 human policies are trained against fixed robot policies, producing a set of adaptive behaviors. The level-2 robot then trains against these adaptive human policies, using a latent embedding to summarize interaction history and approximate nested I-POMDP beliefs, enabling reasoning over multiple adaptive partner strategies.
Figure 2: Multi-recipe Overcooked domain. Each agent selects one of three recipes (PotatoBroccoliSalad, LettuceOnionSalad, or TomatoCarrotSalad) per episode. Each recipe requires both agents to contribute one ingredient from their respective sides, place ingredients on a shared plate, and deliver the completed dish to the serving station. Multiple episodes occur within each round, requiring repeated coordination to establish a shared convention on which recipe to prepare.
Figure 3: Recipe preference trajectories across episodes. Each subplot tracks cumulative actions toward each recipe type (PotatoBroccoliSalad, LettuceOnionSalad, TomatoCarrotSalad) over multiple episodes. Baseline agents oscillate and fail to converge on a single recipe, while the proposed method rapidly establishes a shared convention with its adaptive partner.

Theorems & Definitions (1)

theorem 1: Non-convergence to a Single Convention under Nested Adaptation

Nested Training for Mutual Adaptation in Human-AI Teaming

TL;DR

Abstract

Nested Training for Mutual Adaptation in Human-AI Teaming

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (1)