Table of Contents
Fetching ...

Strategizing against Q-learners: A Control-theoretical Approach

Yuksel Arslantas, Ege Yuceel, Muhammed O. Sayin

TL;DR

This work analyzes how strategically sophisticated opponents can manipulate naive independent Q-learners in repeated normal-form games through a control-theoretic lens. It models the interaction as a stochastic game with the continuum state given by the Naive-types' Q-function estimates, and establishes Lipschitz continuity of the value functions to enable accurate quantization-based approximations. A quantization mapping $\Φ$ reduces the continuum state to a finite SG, with explicit error bounds that scale with the approximation granularity and problem parameters, enabling minimax value iteration in two-player settings. Empirical results in Prisoner's Dilemma and related games show that strategic actors can increase their payoff by steering Q-learners, highlighting vulnerabilities and providing a foundation for designing more robust learning algorithms.

Abstract

In this paper, we explore the susceptibility of the independent Q-learning algorithms (a classical and widely used multi-agent reinforcement learning method) to strategic manipulation of sophisticated opponents in normal-form games played repeatedly. We quantify how much strategically sophisticated agents can exploit naive Q-learners if they know the opponents' Q-learning algorithm. To this end, we formulate the strategic actors' interactions as a stochastic game (whose state encompasses Q-function estimates of the Q-learners) as if the Q-learning algorithms are the underlying dynamical system. We also present a quantization-based approximation scheme to tackle the continuum state space and analyze its performance for two competing strategic actors and a single strategic actor both analytically and numerically.

Strategizing against Q-learners: A Control-theoretical Approach

TL;DR

This work analyzes how strategically sophisticated opponents can manipulate naive independent Q-learners in repeated normal-form games through a control-theoretic lens. It models the interaction as a stochastic game with the continuum state given by the Naive-types' Q-function estimates, and establishes Lipschitz continuity of the value functions to enable accurate quantization-based approximations. A quantization mapping reduces the continuum state to a finite SG, with explicit error bounds that scale with the approximation granularity and problem parameters, enabling minimax value iteration in two-player settings. Empirical results in Prisoner's Dilemma and related games show that strategic actors can increase their payoff by steering Q-learners, highlighting vulnerabilities and providing a foundation for designing more robust learning algorithms.

Abstract

In this paper, we explore the susceptibility of the independent Q-learning algorithms (a classical and widely used multi-agent reinforcement learning method) to strategic manipulation of sophisticated opponents in normal-form games played repeatedly. We quantify how much strategically sophisticated agents can exploit naive Q-learners if they know the opponents' Q-learning algorithm. To this end, we formulate the strategic actors' interactions as a stochastic game (whose state encompasses Q-function estimates of the Q-learners) as if the Q-learning algorithms are the underlying dynamical system. We also present a quantization-based approximation scheme to tackle the continuum state space and analyze its performance for two competing strategic actors and a single strategic actor both analytically and numerically.
Paper Structure (8 sections, 40 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 8 sections, 40 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Illustration of the interaction between N-types and A-types across the repeated play of the normal-form game $\mathcal{G}$. A-types compete against each other while considering N-types' q-function estimates as an underlying state of an SG. A-types can use dynamic programming to solve the SG and use trackers to track N-types' q-function estimates by leveraging the complete model knowledge.
  • Figure 2: The evolution of the empirical averages of the action profiles for Q-learner vs Q-learner in the prisoner's dilemma.
  • Figure 3: The evolution of the empirical averages of the action profiles for strategic actor vs Q-learner in the prisoner's dilemma.
  • Figure 4: The evolution of the empirical averages of the action profiles for Q-learner vs Q-learner in the two-agent four-action zero-sum game.
  • Figure 5: The evolution of the empirical averages of the action profiles for strategic actor vs Q-learner in the two-agent four-action zero-sum game.

Theorems & Definitions (3)

  • proof
  • proof
  • proof