Table of Contents
Fetching ...

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

Fernando Martinez, Tao Li, Yingdong Lu, Juntao Chen

TL;DR

This paper tackles instability in off-policy value-based RL caused by tight coupling between representation learning and value targets. It introduces SCORER, a Stackelberg-based framework where the Q-network acts as a slow-moving leader and the perception encoder as a fast-following follower, with updates governed by two-timescale stochastic approximation. The follower minimizes Bellman error variance to stabilize representation learning, while the leader minimizes MSBE given the follower's best response, enabling stable co-adaptation without extra supervision. Empirical results across DQN variants, PQN, MinAtar, and MiniGrid show improved sample efficiency and final performance, validating the practical impact of hierarchical coupling in deep RL. The work provides a general, lightweight approach to stabilizing representation learning in value-based RL and outlines future extensions to continuous control and theoretical sample complexity.

Abstract

Deep Q-learning jointly learns representations and values within monolithic networks, promising beneficial co-adaptation between features and value estimates. Although this architecture has attained substantial success, the coupling between representation and value learning creates instability as representations must constantly adapt to non-stationary value targets, while value estimates depend on these shifting representations. This is compounded by high variance in bootstrapped targets, which causes bias in value estimation in off-policy methods. We introduce Stackelberg Coupled Representation and Reinforcement Learning (SCORER), a framework for value-based RL that views representation and Q-learning as two strategic agents in a hierarchical game. SCORER models the Q-function as the leader, which commits to its strategy by updating less frequently, while the perception network (encoder) acts as the follower, adapting more frequently to learn representations that minimize Bellman error variance given the leader's committed strategy. Through this division of labor, the Q-function minimizes MSBE while perception minimizes its variance, thereby reducing bias accordingly, with asymmetric updates allowing stable co-adaptation, unlike simultaneous parameter updates in monolithic solutions. Our proposed SCORER framework leads to a bi-level optimization problem whose solution is approximated by a two-timescale algorithm that creates an asymmetric learning dynamic between the two players. Extensive experiments on DQN and its variants demonstrate that gains stem from algorithmic insight rather than model complexity.

Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

TL;DR

This paper tackles instability in off-policy value-based RL caused by tight coupling between representation learning and value targets. It introduces SCORER, a Stackelberg-based framework where the Q-network acts as a slow-moving leader and the perception encoder as a fast-following follower, with updates governed by two-timescale stochastic approximation. The follower minimizes Bellman error variance to stabilize representation learning, while the leader minimizes MSBE given the follower's best response, enabling stable co-adaptation without extra supervision. Empirical results across DQN variants, PQN, MinAtar, and MiniGrid show improved sample efficiency and final performance, validating the practical impact of hierarchical coupling in deep RL. The work provides a general, lightweight approach to stabilizing representation learning in value-based RL and outlines future extensions to continuous control and theoretical sample complexity.

Abstract

Deep Q-learning jointly learns representations and values within monolithic networks, promising beneficial co-adaptation between features and value estimates. Although this architecture has attained substantial success, the coupling between representation and value learning creates instability as representations must constantly adapt to non-stationary value targets, while value estimates depend on these shifting representations. This is compounded by high variance in bootstrapped targets, which causes bias in value estimation in off-policy methods. We introduce Stackelberg Coupled Representation and Reinforcement Learning (SCORER), a framework for value-based RL that views representation and Q-learning as two strategic agents in a hierarchical game. SCORER models the Q-function as the leader, which commits to its strategy by updating less frequently, while the perception network (encoder) acts as the follower, adapting more frequently to learn representations that minimize Bellman error variance given the leader's committed strategy. Through this division of labor, the Q-function minimizes MSBE while perception minimizes its variance, thereby reducing bias accordingly, with asymmetric updates allowing stable co-adaptation, unlike simultaneous parameter updates in monolithic solutions. Our proposed SCORER framework leads to a bi-level optimization problem whose solution is approximated by a two-timescale algorithm that creates an asymmetric learning dynamic between the two players. Extensive experiments on DQN and its variants demonstrate that gains stem from algorithmic insight rather than model complexity.

Paper Structure

This paper contains 51 sections, 2 theorems, 40 equations, 8 figures, 16 tables, 1 algorithm.

Key Result

Lemma J.1

with

Figures (8)

  • Figure 1: SCORER framework. (Left) Overall agent-environment interaction loop. Internally, the agent comprises a perception network (Follower, $f_\phi$) and a control network (Leader, $Q_\theta$) that interact via Stackelberg game dynamics ($U_F, U_L$ representing their utility functions). The perception network produces features $z=f_\phi(s)$ used by the control network. (Right) Details the Stackelberg interaction within the agent.
  • Figure 2: Learning curves on MinAtar environments comparing SCORER variants against baselines. SCORER variants generally demonstrate improved sample efficiency and performance across several algorithm-environment combinations.
  • Figure 3: Time-to-threshold analysis showing mean timesteps (in millions) $\pm$ 95% confidence interval to reach 99% maximum performance over 30 seeds. SR (%) is the success rate of runs reaching the threshold.
  • Figure 4: Ablation studies on SCORER's core components on MinAtar. (Left) Follower's Objective: Bellman Error (BE) Variance is superior to MSBE. (Center) Stackelberg Roles: The standard SCORER hierarchy outperforms the inverted role configuration. (Right) Coupling Dynamic: SCORER's hierarchical coupling is critical for performance, substantially outperforming a synchronous baseline on Breakout.
  • Figure 5: Learning curves on classic control environments (CartPole-v1, Acrobot-v1). Each row corresponds to a base algorithm (DQN, DDQN, DuelingDQN, DuelingDDQN), and each column to an environment. Curves show IQM return over 30 seeds; shaded regions represent 95% confidence intervals.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Lemma J.1: Lemma 2.2 from ghadimi2018approximationmethodsbilevelprogramming
  • Theorem J.1