Table of Contents
Fetching ...

Markov Persuasion Processes: Learning to Persuade from Scratch

Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

TL;DR

This work tackles learning in Markov Persuasion Processes when the sender has no knowledge about the environment. It introduces Optimistic Persuasive Policy Search (OPPS) for both full and partial feedback, leveraging occupancy-measure formulations and optimistic linear programs to learn signaling policies under persuasiveness constraints. The full-feedback setting achieves sublinear regret and violation at rate $\tilde{O}(\sqrt{T})$, while the partial-feedback setting introduces a two-phase exploration-exploitation scheme that yields a tunable trade-off: $R_T=\tilde{O}(T^\alpha)$ and $V_T=\tilde{O}(T^{1-\alpha/2})$ for $\alpha\in[1/2,1]$, with a matching lower bound. These results establish that learning-based sequential persuasion is feasible in unknown Markov environments and quantify the inevitable trade-offs between persuasiveness and performance. The occupancy-measure approach and optimistic LP framework provide a practical and theoretically grounded pathway for deploying sequential persuasion under uncertainty.

Abstract

In Bayesian persuasion, an informed sender strategically discloses information to a receiver so as to persuade them to undertake desirable actions. Recently, a growing attention has been devoted to settings in which sender and receivers interact sequentially. Recently, Markov persuasion processes (MPPs) have been introduced to capture sequential scenarios where a sender faces a stream of myopic receivers in a Markovian environment. The MPPs studied so far in the literature suffer from issues that prevent them from being fully operational in practice, e.g., they assume that the sender knows receivers' rewards. We fix such issues by addressing MPPs where the sender has no knowledge about the environment. We design a learning algorithm for the sender, working with partial feedback. We prove that its regret with respect to an optimal information-disclosure policy grows sublinearly in the number of episodes, as it is the case for the loss in persuasiveness cumulated while learning. Moreover, we provide a lower bound for our setting matching the guarantees of our algorithm.

Markov Persuasion Processes: Learning to Persuade from Scratch

TL;DR

This work tackles learning in Markov Persuasion Processes when the sender has no knowledge about the environment. It introduces Optimistic Persuasive Policy Search (OPPS) for both full and partial feedback, leveraging occupancy-measure formulations and optimistic linear programs to learn signaling policies under persuasiveness constraints. The full-feedback setting achieves sublinear regret and violation at rate , while the partial-feedback setting introduces a two-phase exploration-exploitation scheme that yields a tunable trade-off: and for , with a matching lower bound. These results establish that learning-based sequential persuasion is feasible in unknown Markov environments and quantify the inevitable trade-offs between persuasiveness and performance. The occupancy-measure approach and optimistic LP framework provide a practical and theoretically grounded pathway for deploying sequential persuasion under uncertainty.

Abstract

In Bayesian persuasion, an informed sender strategically discloses information to a receiver so as to persuade them to undertake desirable actions. Recently, a growing attention has been devoted to settings in which sender and receivers interact sequentially. Recently, Markov persuasion processes (MPPs) have been introduced to capture sequential scenarios where a sender faces a stream of myopic receivers in a Markovian environment. The MPPs studied so far in the literature suffer from issues that prevent them from being fully operational in practice, e.g., they assume that the sender knows receivers' rewards. We fix such issues by addressing MPPs where the sender has no knowledge about the environment. We design a learning algorithm for the sender, working with partial feedback. We prove that its regret with respect to an optimal information-disclosure policy grows sublinearly in the number of episodes, as it is the case for the loss in persuasiveness cumulated while learning. Moreover, we provide a lower bound for our setting matching the guarantees of our algorithm.
Paper Structure (38 sections, 23 theorems, 87 equations, 3 algorithms)

This paper contains 38 sections, 23 theorems, 87 equations, 3 algorithms.

Key Result

Lemma 1

A vector $q \in [0, 1]^{|X\times \Omega \times A\times X|}$ is a valid occupancy measure of an MPP if and only if it holds: where $P$ is the transition function of the MPP and $\mu$ its prior function, while $P^q$ and $\mu^q$ are the transition and prior functions, respectively, induced by $q$ (see definitions below).

Theorems & Definitions (32)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Lemma 4
  • Lemma 5
  • ...and 22 more