Table of Contents
Fetching ...

Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to $Q$-Learning

Ankur Naskar, Gugan Thoppe, Vijay Gupta

Abstract

Algorithms for solving \textit{nonlinear} fixed-point equations -- such as average-reward \textit{$Q$-learning} and \textit{TD-learning} -- often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak--Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free $\tilde{O}(1/\sqrt{t})$ optimal rates for $Q$-learning in both average-reward and exponentially discounted settings, where $t$ denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to $Q$-Learning

Abstract

Algorithms for solving \textit{nonlinear} fixed-point equations -- such as average-reward \textit{-learning} and \textit{TD-learning} -- often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak--Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free optimal rates for -learning in both average-reward and exponentially discounted settings, where denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

Paper Structure

This paper contains 24 sections, 14 theorems, 126 equations, 1 figure, 2 tables.

Key Result

Lemma 3.1

chen2025non For any semi-norm $\ifblank{}{\upsilon}{\upsilon()}: \mathbb{R}^d \to \mathbb{R},$ one can define an induced norm $\|\cdot\|$ such that, for all $Q \in \mathbb{R}^d,$ where $E$ is as defined in e:norm.vanishing.subspace.

Figures (1)

  • Figure 1: Performance of distributed $Q$-learning with $N$ agents. Top row: parameter-dependent (left) versus parameter-free (our work) (right) versions of synchronous distributed average-reward $Q$-learning. Bottom row: parameter-dependent and parameter-free versions of asynchronous distributed exponentially-discounted $Q$-learning. Each curve shows the error averaged over $50$ independent runs of the algorithm, all initialized identically with $Q_0=\bar{Q}_0=0$. The parameter-dependent versions use the linear stepsize $\alpha_t=c_1/t+c_2$, where $c_1=4/(1-\beta)$ and $c_2=\max\left\{\frac{2.88\log(|\mathcal{S}| |\mathcal{A}|)}{(1-\beta)^2},3\right\}$, with $\beta=\gamma=0.1$ in the discounted setting and $\beta=0.75$ in the average-reward setting. The parameter-free stepsize is chosen as $\alpha_t=(t+1)^{-0.75}$. The error is measured with respect to the fixed point $Q^*_1$ of a fixed MDP. The plots show that the iterates $(\bar{Q}_T)$ converge to $Q^*_1$ at the desired rate of $\tilde{O}(1/\sqrt{T})$. Moreover, performance improves as $N$ increases. Most importantly, our parameter-free algorithms (right) perform comparably to the parameter-dependent algorithms (left).

Theorems & Definitions (37)

  • Lemma 3.1
  • Remark 3.2
  • Remark 3.3
  • Remark 3.4
  • Theorem 3.5
  • Remark 3.6: Optimal Convergence Rate
  • Remark 3.7: Linear Speedup
  • Remark 3.8: Parameter-Free Stepsize
  • Theorem 3.9: Synchronous Average-Reward Q-learning
  • Remark 3.10
  • ...and 27 more