Table of Contents
Fetching ...

The Algorithmic Advantage: How Reinforcement Learning Generates Rich Communication

Emilio Calvano, Clemens Possnig, Juha Tolvanen

TL;DR

This paper embeds a tabular Q-learning sender into the Crawford–Sobel cheap-talk framework to analyze how reward-driven adaptation shapes long-run communication under a rational, best-responding receiver. It provides a theoretical lower bound on informativeness in the no-bias case and shows that with misaligned preferences the learning dynamics exhibit cycles that outperform any static equilibrium, while never fully converging to a fixed language. The results reveal a fundamental trade-off: exploration and reward-driven updates promote informative communication and welfare gains, but may prevent full revelation or convergence, especially under misalignment. The findings have practical implications for algorithmic advice on platforms, highlighting both resilience to uninformative outcomes and potential for unstable or cyclical language when incentives are not aligned.

Abstract

We analyze strategic communication when advice is generated by a reinforcement-learning algorithm rather than by a fully rational sender. Building on the cheap-talk framework of Crawford and Sobel (1982), an advisor adapts its messages based on payoff feedback, while a decision maker best-responds. We provide a theoretical analysis of the long-run communication outcomes induced by such reward-driven adaptation. With aligned preferences, we establish that learning robustly leads to informative communication even from uninformative initial policies. With misaligned preferences, no stable outcome exists; instead, learning generates cycles that sustain highly informative communication and payoffs exceeding those of any static equilibrium.

The Algorithmic Advantage: How Reinforcement Learning Generates Rich Communication

TL;DR

This paper embeds a tabular Q-learning sender into the Crawford–Sobel cheap-talk framework to analyze how reward-driven adaptation shapes long-run communication under a rational, best-responding receiver. It provides a theoretical lower bound on informativeness in the no-bias case and shows that with misaligned preferences the learning dynamics exhibit cycles that outperform any static equilibrium, while never fully converging to a fixed language. The results reveal a fundamental trade-off: exploration and reward-driven updates promote informative communication and welfare gains, but may prevent full revelation or convergence, especially under misalignment. The findings have practical implications for algorithmic advice on platforms, highlighting both resilience to uninformative outcomes and potential for unstable or cyclical language when incentives are not aligned.

Abstract

We analyze strategic communication when advice is generated by a reinforcement-learning algorithm rather than by a fully rational sender. Building on the cheap-talk framework of Crawford and Sobel (1982), an advisor adapts its messages based on payoff feedback, while a decision maker best-responds. We provide a theoretical analysis of the long-run communication outcomes induced by such reward-driven adaptation. With aligned preferences, we establish that learning robustly leads to informative communication even from uninformative initial policies. With misaligned preferences, no stable outcome exists; instead, learning generates cycles that sustain highly informative communication and payoffs exceeding those of any static equilibrium.
Paper Structure (15 sections, 8 theorems, 61 equations, 6 figures)

This paper contains 15 sections, 8 theorems, 61 equations, 6 figures.

Key Result

Theorem 1

For all $\zeta>0$ there exists $\underline{\tau} ,\varepsilon>0$ small enough such that for all $\mu_0$, $\mathbf{L}(\mu_0) = \{\mu^*\}$ implies there exists $\mu' \in \mathbf{NE}_{CS}$ such that $\mu^* \in \mathbf{B}_{\zeta}(\mu ')$ almost surely.

Figures (6)

  • Figure 1: Worst-case policies achieving $\underline{U}_K$ for $K\in \{15,19\}$, represented by a heatmap. The $x$-axis refers to states $x\in X_K$, $y$-axis to messages $m\in M_K$, while color indicates probability mass. In (a), $K=15$ and the grid cannot fit the sequence of pools that always increase by $1$, so the construction repeats pool sizes. The worst case was computed by brute force according to the constraints given by SAPS, MSFR, and connectedness. In (b), where $K=19$, we have an example where the sequence fits exactly, and no computation is necessary. We implement the policy with middle state ($8$ in (a), $10$ in (b)) fully revealing by mixing over messages, as is typically learnt by the algorithm as shown in Figure \ref{['fig: nobias_pol']}.
  • Figure 2: Normalized welfare of final policy in simulation for differing initial policies $\mu_0$ given $K=21$. The computation finds all pure equilibria featuring connected pools, i.e. if states $x,x'$ send message $m$, so will any $x" \in [x,x']$. Any more general set of equilibria is computationally infeasible. (a) shows the welfare measure of final policies due to babbling initial policies, while (b) shows those due to fully revealing initial policies. $T=10^8$, $N=1000$. The minimum and maximum level achieved are $.980$, $.995$ under babbling initial, and $.985, .997$ under full revelation initial. Here, $\underline U_K = .980$.
  • Figure 3: Example of final policy learnt: we show the policy that achieves the median welfare level by normalized measure among our simulations. Here, (a) refers to the babbling initialization with median level of $0.986$, and (b) the fully revealing initialization with median level of $0.995$. $T=10^8$, $N=1000$.
  • Figure 4: Example of cyclical policy for two bias levels. We track the last 500 iterations whenever a policy changed by more than $0.2$ in at least one $(x,m)$ entry, and take an average. (a) refers to bias level $0.1$, (b) refers to a bias of $0.2$.
  • Figure 5: Payoff ratio $D_n = \frac{U^n-U^n_{CS}}{|U^n_{CS}|}$ for Sender and Receiver between payoff $U^n$ averaged over all final policies from simulation runs, and best PBE payoff $U^n_{CS}$, $n\in \{S,R\}$. The expected payoff is computed for the final policy of each simulation run, then averaged, given $N=1000$ simulations and $T=10^7$.
  • ...and 1 more figures

Theorems & Definitions (16)

  • Definition 1
  • Theorem 1
  • Definition 2: Connected policy
  • Definition 3: Middle state fully revealing, MSFR
  • Definition 4: Similar adjacent pool sizes, SAPS
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Proposition 1
  • proof
  • ...and 6 more