Stackelberg POMDP: A Reinforcement Learning Approach for Economic Design
Gianluca Brero, Alon Eden, Darshan Chakrabarti, Matthias Gerstgrasser, Amy Greenwald, Vincent Li, David C. Parkes
TL;DR
This work formalizes indirect mechanism design as a Stackelberg Partially Observable Markov Game (Stackelberg POMG) and introduces the Stackelberg POMDP as a single-agent RL formulation that captures follower adaptation to a leader’s rules. It proves that, within a restricted leader policy class, the optimal Stackelberg leader policy corresponds to the optimal policy of the Stackelberg POMDP, and solves this using centralized training with decentralized execution (MAPPO) paired with policy-interactive follower models. The framework accommodates diverse follower behaviors, including no-regret learning leading to Bayesian Coarse Correlated Equilibria (BCCE), and is validated on increasingly complex settings like Simple Allocation Mechanisms and Sequential Price Mechanisms with Messages (MSPMs), achieving welfare gains with limited communication. The results suggest this approach can automate and improve the design of indirect mechanisms in multi-round, uncertainty-rich markets, while highlighting practical considerations around data representativeness and deployment objectives.
Abstract
We introduce a reinforcement learning framework for economic design where the interaction between the environment designer and the participants is modeled as a Stackelberg game. In this game, the designer (leader) sets up the rules of the economic system, while the participants (followers) respond strategically. We integrate algorithms for determining followers' response strategies into the leader's learning environment, providing a formulation of the leader's learning problem as a POMDP that we call the Stackelberg POMDP. We prove that the optimal leader's strategy in the Stackelberg game is the optimal policy in our Stackelberg POMDP under a limited set of possible policies, establishing a connection between solving POMDPs and Stackelberg games. We solve our POMDP under a limited set of policy options via the centralized training with decentralized execution framework. For the specific case of followers that are modeled as no-regret learners, we solve an array of increasingly complex settings, including problems of indirect mechanism design where there is turn-taking and limited communication by agents. We demonstrate the effectiveness of our training framework through ablation studies. We also give convergence results for no-regret learners to a Bayesian version of a coarse-correlated equilibrium, extending known results to correlated types.
