The Algorithmic Advantage: How Reinforcement Learning Generates Rich Communication
Emilio Calvano, Clemens Possnig, Juha Tolvanen
TL;DR
This paper embeds a tabular Q-learning sender into the Crawford–Sobel cheap-talk framework to analyze how reward-driven adaptation shapes long-run communication under a rational, best-responding receiver. It provides a theoretical lower bound on informativeness in the no-bias case and shows that with misaligned preferences the learning dynamics exhibit cycles that outperform any static equilibrium, while never fully converging to a fixed language. The results reveal a fundamental trade-off: exploration and reward-driven updates promote informative communication and welfare gains, but may prevent full revelation or convergence, especially under misalignment. The findings have practical implications for algorithmic advice on platforms, highlighting both resilience to uninformative outcomes and potential for unstable or cyclical language when incentives are not aligned.
Abstract
We analyze strategic communication when advice is generated by a reinforcement-learning algorithm rather than by a fully rational sender. Building on the cheap-talk framework of Crawford and Sobel (1982), an advisor adapts its messages based on payoff feedback, while a decision maker best-responds. We provide a theoretical analysis of the long-run communication outcomes induced by such reward-driven adaptation. With aligned preferences, we establish that learning robustly leads to informative communication even from uninformative initial policies. With misaligned preferences, no stable outcome exists; instead, learning generates cycles that sustain highly informative communication and payoffs exceeding those of any static equilibrium.
