Table of Contents
Fetching ...

MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning

Arundhati Banerjee, Soham Phade, Stefano Ermon, Stephan Zheng

TL;DR

MERMAIDE introduces model-based meta-learning to align learners by learning a world model and a meta-learned intervention policy that quickly adapts to unseen agents. By treating each agent as a task and employing MAML with a recurrent world model, the approach achieves fast near-equilibrium alignment in Stackelberg games and cost-efficient intervention policies in bandit settings, even under partial observability and distribution shifts. The framework outperforms model-free baselines and provides insights into when and how to intervene, highlighting the value of model-based priors for non-stationary principal–agent environments. This work offers a flexible, few-shot generalizable method for adaptive incentive design with potential impact on economies, education, and personalized systems where agents learn over time. The results underscore the practical significance of combining world models with gradient-based meta-learning to handle non-stationarity and unseen agent strategies in real-world interventions.

Abstract

We study how a principal can efficiently and effectively intervene on the rewards of a previously unseen learning agent in order to induce desirable outcomes. This is relevant to many real-world settings like auctions or taxation, where the principal may not know the learning behavior nor the rewards of real people. Moreover, the principal should be few-shot adaptable and minimize the number of interventions, because interventions are often costly. We introduce MERMAIDE, a model-based meta-learning framework to train a principal that can quickly adapt to out-of-distribution agents with different learning strategies and reward functions. We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both $0$-shot and $K=1$-shot settings with partial agent information.

MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning

TL;DR

MERMAIDE introduces model-based meta-learning to align learners by learning a world model and a meta-learned intervention policy that quickly adapts to unseen agents. By treating each agent as a task and employing MAML with a recurrent world model, the approach achieves fast near-equilibrium alignment in Stackelberg games and cost-efficient intervention policies in bandit settings, even under partial observability and distribution shifts. The framework outperforms model-free baselines and provides insights into when and how to intervene, highlighting the value of model-based priors for non-stationary principal–agent environments. This work offers a flexible, few-shot generalizable method for adaptive incentive design with potential impact on economies, education, and personalized systems where agents learn over time. The results underscore the practical significance of combining world models with gradient-based meta-learning to handle non-stationarity and unseen agent strategies in real-world interventions.

Abstract

We study how a principal can efficiently and effectively intervene on the rewards of a previously unseen learning agent in order to induce desirable outcomes. This is relevant to many real-world settings like auctions or taxation, where the principal may not know the learning behavior nor the rewards of real people. Moreover, the principal should be few-shot adaptable and minimize the number of interventions, because interventions are often costly. We introduce MERMAIDE, a model-based meta-learning framework to train a principal that can quickly adapt to out-of-distribution agents with different learning strategies and reward functions. We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both -shot and -shot settings with partial agent information.
Paper Structure (47 sections, 8 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 47 sections, 8 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of MERMAIDE. Left: Flow of principal and agent observables, rewards, and actions. Right: The principal's world model and intervention policy. Also see \ref{['algo:our_algo']}.
  • Figure 2: Single round game. REINFORCE (RL) does not adapt to expected Stackelberg equilibrium during evaluation. MAML's adaptability suffers under observation noise.
  • Figure 3: Multi-round game. (a) Principal's optimization trajectory in the expected payoff landscape during training. Axes are PCA directions in the policy parameter space. Colorbar indicates the principal's expected payoff over the training agents. (b)(c) MAML adapts (single shot) to Stackelberg equilibrium with a best response agent. The number of interventions are normalized by the episode length $T=100$.
  • Figure 4: Characterizing agent's behavior. UCB agent with base rewards $\left[0.16, 0.11, 0.66, 0.14, 0.20, 0.37, \textbf{0.82}, 0.10, \textbf{0.84}, 0.10\right]$. The agent prefers the action with base reward 0.84, while the principal prefers the action with base reward 0.82. Horizontal axis indicates time steps $t = \{1,\dots,200\}$. Vertical axis indicates agents following UCB with different exploration coefficient $\beta$. Values are either 0 or 1. (a) Frequency distribution of agent selecting its unintervened preferred action with base reward 0.84. (b) Frequency distribution of agent selecting $a^*$ without principal's intervention. (c) Frequency distribution of agent selecting $a^*$ under principal's intervention S1. (d) Frequency distribution of agent selecting $a^*$ under principal's intervention S2. For a small $\delta = \max_{a\in A}\bm{r}^i[a] - \bm{r}^i[a^*] = 0.02$, both S1 and S2 affect the agent's behavior quite similarly.
  • Figure 5: Characterizing agent's behavior. UCB agent with base rewards $\left[0.32, 0.67, 0.13, 0.72, 0.29, 0.18, \textbf{0.59}, 0.02, \textbf{0.83}, 0.01\right]$. The agent prefers the action with base reward 0.83, while the principal prefers the action with base reward 0.59. Horizontal axis indicates time steps $t = \{1,\dots,200\}$. Vertical axis indicates agents following UCB with different exploration coefficient $\beta$. Values are either 0 or 1. (a) Frequency distribution of agent selecting its unintervened preferred action with base reward 0.83. (b) Frequency distribution of agent selecting $a^*$ without principal's intervention. (c) Frequency distribution of agent selecting $a^*$ under principal's intervention S1. (d) Frequency distribution of agent selecting $a^*$ under principal's intervention S2. For different values of $\beta$, the UCB agent acts differently based on when the principal intervened following S1 or S2. S1 intervenes periodically whereas S2 intervenes only at the beginning. The action selected by the agents in an episode clearly reflects the effect that this has in being able to align the agent's preference with that of the principal.
  • ...and 1 more figures