Table of Contents
Fetching ...

Learning to Steer Markovian Agents under Model Uncertainty

Jiawei Huang, Vinzenz Thoma, Zebang Shen, Heinrich H. Nax, Niao He

TL;DR

The paper studies steering Markovian agents under model uncertainty in a finite-horizon, non-episodic RL setting, introducing history-dependent steering to cope with unknown learning dynamics. It proposes a novel objective that optimizes average steering goals minus costs over a finite model class $\mathcal{F}$ containing the true dynamics $f^*$ and defines $\varepsilon$-steering gap and Pareto optimality as performance criteria. The authors establish existence results under both known and unknown $f^*$, and develop algorithms for small ($|\mathcal{F}|$) and large model classes, including belief-state DP and the First-Explore-Then-Exploit (FETE) framework with identifiability guarantees. Through experiments on Stag Hunt and related games, the approach demonstrates effective steering under model uncertainty, including scenarios without access to the true dynamics.

Abstract

Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents' underlying learning dynamics. Motivated by the limitation of existing works, we consider a new and general category of learning dynamics called \emph{Markovian agents}. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our steering problem. Importantly, we focus on learning a \emph{history-dependent} steering strategy to handle the inherent model uncertainty about the agents' learning dynamics. We introduce a novel objective function to encode the desiderata of achieving a good steering outcome with reasonable cost. Theoretically, we identify conditions for the existence of steering strategies to guide agents to the desired policies. Complementing our theoretical contributions, we provide empirical algorithms to approximately solve our objective, which effectively tackles the challenge in learning history-dependent strategies. We demonstrate the efficacy of our algorithms through empirical evaluations.

Learning to Steer Markovian Agents under Model Uncertainty

TL;DR

The paper studies steering Markovian agents under model uncertainty in a finite-horizon, non-episodic RL setting, introducing history-dependent steering to cope with unknown learning dynamics. It proposes a novel objective that optimizes average steering goals minus costs over a finite model class $\mathcal{F}$ containing the true dynamics $f^*$ and defines $\varepsilon$-steering gap and Pareto optimality as performance criteria. The authors establish existence results under both known and unknown $f^*$, and develop algorithms for small ($|\mathcal{F}|$) and large model classes, including belief-state DP and the First-Explore-Then-Exploit (FETE) framework with identifiability guarantees. Through experiments on Stag Hunt and related games, the approach demonstrates effective steering under model uncertainty, including scenarios without access to the true dynamics.

Abstract

Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents' underlying learning dynamics. Motivated by the limitation of existing works, we consider a new and general category of learning dynamics called \emph{Markovian agents}. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our steering problem. Importantly, we focus on learning a \emph{history-dependent} steering strategy to handle the inherent model uncertainty about the agents' learning dynamics. We introduce a novel objective function to encode the desiderata of achieving a good steering outcome with reasonable cost. Theoretically, we identify conditions for the existence of steering strategies to guide agents to the desired policies. Complementing our theoretical contributions, we provide empirical algorithms to approximately solve our objective, which effectively tackles the challenge in learning history-dependent strategies. We demonstrate the efficacy of our algorithms through empirical evaluations.
Paper Structure (73 sections, 14 theorems, 65 equations, 6 figures, 2 tables, 4 algorithms)

This paper contains 73 sections, 14 theorems, 65 equations, 6 figures, 2 tables, 4 algorithms.

Key Result

Proposition 3.2

[Justification for Obj. obj:objective_function] By solving Obj. obj:objective_function: (1) $\psi^*$ is Pareto Optimal; (2) Given any $\varepsilon, \varepsilon' > 0$, if $\Psi^{\varepsilon/|\mathcal{F}|} \neq \emptyset$ and $\beta \geq \frac{U_{\max}NHT|\mathcal{F}|}{\varepsilon'}$, we have $\psi^*

Figures (6)

  • Figure 1: Example: The "Stag Hunt" Game
  • Figure 2: Grid-World Version of Stag Hunt Game. Left: Illustration of game. Right: The performance of agents with/without steering. Without steering, the agents converge to go for hares, which has sub-optimal utility. Under our learned steering strategy, the agents converge to a better equilibrium and chase the stag.
  • Figure 3: Evaluation for Proc. \ref{['procedure:large_model_set']}.Left: Accuracy of MLE estimator ($\lambda^n_{\text{MLE}}$) after doing exploration for $t$ steps. Ours can achieve near 100% accuracy after 30 steering steps, while the random exploration takes more than 300 steps. Middle and Right: Average steering gap and steering cost of Oracle, FETE and FETE-RE. Our FETE achieves competitive performance comparing with Oracle, and significantly outperforms FETE-RE (adaption of SIAR-MPC canyakmaz2024steering to our setting) in terms of steering gap.
  • Figure 4: Probabilistic Graphic Model (PGM) of the POMDP formulation of the steering process. Starting with the initial state $x_1:=(\pi_1,\tau_1)$, for all $t \geq 1$, the mediator receives observation $o_t\sim{\mathbb{O}}(\cdot|x_t)$ and output the steering reward given the history $u_t \sim \psi(\cdot|o_1,u_1,...,o_t)$. The agents then update their policies following the dynamics $f$ and the modified reward function $r+u_t$.
  • Figure 5: Trade-off between Steering Gap (Left) and Steering Cost (Right). (averaged over 5x5 uniformly distributed grids as initializations of ${\bm{\pi}}_1$, see Appx. \ref{['appx:initialization']}).
  • ...and 1 more figures

Theorems & Definitions (37)

  • Definition 3.1: Markovian Agents
  • Definition 3.2: Finite Horizon Non-Episodic Steering Setting
  • Proposition 3.2
  • Definition 4.1: Natural Policy Gradient
  • Theorem 4.2: Informal
  • Definition 4.3: $(\delta,T_{\mathcal{F}}^\delta)$-Identifiable
  • Example 4.3
  • Theorem 4.4
  • Proposition C.0
  • proof
  • ...and 27 more