Table of Contents
Fetching ...

XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning

Daniel Palenicek, Florian Vogt, Joe Watson, Ingmar Posner, Jan Peters

Abstract

Sample efficiency is a central property of effective deep reinforcement learning algorithms. Recent work has improved this through added complexity, such as larger models, exotic network architectures, and more complex algorithms, which are typically motivated purely by empirical performance. We take a more principled approach by focusing on the optimization landscape of the critic network. Using the eigenspectrum and condition number of the critic's Hessian, we systematically investigate the impact of common architectural design decisions on training dynamics. Our analysis reveals that a novel combination of batch normalization (BN), weight normalization (WN), and a distributional cross-entropy (CE) loss produces condition numbers orders of magnitude smaller than baselines. This combination also naturally bounds gradient norms, a property critical for maintaining a stable effective learning rate under non-stationary targets and bootstrapping. Based on these insights, we introduce XQC: a well-motivated, sample-efficient deep actor-critic algorithm built upon soft actor-critic that embodies these optimization-aware principles. We achieve state-of-the art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks, all while using significantly fewer parameters than competing methods. Our code is available at danielpalenicek.github.io/projects/xqc.

XQC: Well-conditioned Optimization Accelerates Deep Reinforcement Learning

Abstract

Sample efficiency is a central property of effective deep reinforcement learning algorithms. Recent work has improved this through added complexity, such as larger models, exotic network architectures, and more complex algorithms, which are typically motivated purely by empirical performance. We take a more principled approach by focusing on the optimization landscape of the critic network. Using the eigenspectrum and condition number of the critic's Hessian, we systematically investigate the impact of common architectural design decisions on training dynamics. Our analysis reveals that a novel combination of batch normalization (BN), weight normalization (WN), and a distributional cross-entropy (CE) loss produces condition numbers orders of magnitude smaller than baselines. This combination also naturally bounds gradient norms, a property critical for maintaining a stable effective learning rate under non-stationary targets and bootstrapping. Based on these insights, we introduce XQC: a well-motivated, sample-efficient deep actor-critic algorithm built upon soft actor-critic that embodies these optimization-aware principles. We achieve state-of-the art sample efficiency across 55 proprioception and 15 vision-based continuous control tasks, all while using significantly fewer parameters than competing methods. Our code is available at danielpalenicek.github.io/projects/xqc.

Paper Structure

This paper contains 34 sections, 9 theorems, 16 equations, 26 figures, 1 table.

Key Result

Proposition 1

The loss, $l({\bm{y}}, \hat{{\bm{y}}}) = \frac{1}{2}||{\bm{y}} -\hat{{\bm{y}}}||_2^2$ has unbounded gradients w.r.t. $\hat{{\bm{y}}}$,

Figures (26)

  • Figure 1: Well-conditioned network architectures yield state-of-the-art RL performance. Our algorithm, xqc with a BN and WN-based architecture and a CE loss, achieves competitive performance against state-of-the-art baselines across $\text{55}$ proprioceptive continuous control tasks from four different benchmarks with a single set of hyperparameters. Notably, with $\sim\!4.5\times$ fewer parameters and $\sim\!5\times$ less compute in terms of flop/s than simba-v2, the closest competitor. xqc's efficiency carries over to RL from pixels on $\text{15}$ vision-based DMC tasks, significantly improving on drq-v2.
  • Figure 2: When performing gradient-based optimization, the condition number ($\kappa$) of the objective's Hessian significantly impacts convergence. We illustrate this phenomenon with a simple two-dimensional quadratic example. As $\kappa$ increases by an order of magnitude, gradient descent converges at a lower rate. We believe this phenomenon plays a similar role when learning the critic in deep reinforcement learning, where high condition numbers lead to poor sample efficiency.
  • Figure 3: Eigenvalues and condition numbers on dog-trot over 5 seeds for different critic architectures during training. The top and middle rows show the eigenspectra of the CE loss and MSE loss, respectively. The columns correspond to different combinations of normalization layers and WN. The bottom row shows the IQM and 90% SBCI of the condition number $\kappa$ aggregated over five seeds for CE and MSE losses, respectively. Architectures using BN show more compact and stable eigenspectra over the course of training with no outliers. LN suffers from large outlier modes and includes overall larger eigenvalues. Similarly, the CE loss significantly improves loss landscape conditioning over an MSE.
  • Figure 4: The condition numbers and maximum eigenvalues against the return at 1M steps on DMC dog-trot. Normalization strategies are color-coded BN, LN, Dense. Use of WN = empty shape , whereas no WN is represented by a filled shape . MSE loss = and CE = . Architectures with lower condition numbers and lower maximum eigenvalues tend to have better final returns. Also, BN, WN, and the categorical CE loss each improve the loss conditioning independently (columns 1-3). Combined, they result in the best conditioning and best performance . For reference, we include simba-v2 a strong baseline with a similarly low condition number.
  • Figure 5: The xqc network architecture consists of only three standard components: Linear, BN, and ReLU for a total of 4 hidden layers.
  • ...and 21 more figures

Theorems & Definitions (21)

  • Definition 1
  • Definition 2
  • Definition 3
  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Proposition 3
  • Proposition 4
  • Lemma 1
  • proof
  • ...and 11 more