Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
Fernando Martinez, Tao Li, Yingdong Lu, Juntao Chen
TL;DR
This paper tackles instability in off-policy value-based RL caused by tight coupling between representation learning and value targets. It introduces SCORER, a Stackelberg-based framework where the Q-network acts as a slow-moving leader and the perception encoder as a fast-following follower, with updates governed by two-timescale stochastic approximation. The follower minimizes Bellman error variance to stabilize representation learning, while the leader minimizes MSBE given the follower's best response, enabling stable co-adaptation without extra supervision. Empirical results across DQN variants, PQN, MinAtar, and MiniGrid show improved sample efficiency and final performance, validating the practical impact of hierarchical coupling in deep RL. The work provides a general, lightweight approach to stabilizing representation learning in value-based RL and outlines future extensions to continuous control and theoretical sample complexity.
Abstract
Deep Q-learning jointly learns representations and values within monolithic networks, promising beneficial co-adaptation between features and value estimates. Although this architecture has attained substantial success, the coupling between representation and value learning creates instability as representations must constantly adapt to non-stationary value targets, while value estimates depend on these shifting representations. This is compounded by high variance in bootstrapped targets, which causes bias in value estimation in off-policy methods. We introduce Stackelberg Coupled Representation and Reinforcement Learning (SCORER), a framework for value-based RL that views representation and Q-learning as two strategic agents in a hierarchical game. SCORER models the Q-function as the leader, which commits to its strategy by updating less frequently, while the perception network (encoder) acts as the follower, adapting more frequently to learn representations that minimize Bellman error variance given the leader's committed strategy. Through this division of labor, the Q-function minimizes MSBE while perception minimizes its variance, thereby reducing bias accordingly, with asymmetric updates allowing stable co-adaptation, unlike simultaneous parameter updates in monolithic solutions. Our proposed SCORER framework leads to a bi-level optimization problem whose solution is approximated by a two-timescale algorithm that creates an asymmetric learning dynamic between the two players. Extensive experiments on DQN and its variants demonstrate that gains stem from algorithmic insight rather than model complexity.
