Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

Narim Jeong; Donghwan Lee

Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

Narim Jeong, Donghwan Lee

TL;DR

This work develops a unified finite-time error analysis for two soft Q-learning variants—log-sum-exp (LSE) and Boltzmann—by modeling their updates as switching nonlinear systems. The authors construct lower and upper comparison systems to bound the true dynamics, derive non-asymptotic error bounds with constant and decaying terms, and extend the analysis to both operators under mild assumptions. The results reveal how step size $ alpha$, sharpness $eta$, and problem parameters influence convergence, and they corroborate the theory with empirical simulations on a small MDP. Overall, the switching-system framework provides tractable, non-asymptotic convergence guarantees for entropy-regularized RL algorithms and suggests a path toward analyzing other reinforcement learning methods in finite time.

Abstract

Soft Q-learning is a variation of Q-learning designed to solve entropy regularized Markov decision problems where an agent aims to maximize the entropy regularized value function. Despite its empirical success, there have been limited theoretical studies of soft Q-learning to date. This paper aims to offer a novel and unified finite-time, control-theoretic analysis of soft Q-learning algorithms. We focus on two types of soft Q-learning algorithms: one utilizing the log-sum-exp operator and the other employing the Boltzmann operator. By using dynamical switching system models, we derive novel finite-time error bounds for both soft Q-learning algorithms. We hope that our analysis will deepen the current understanding of soft Q-learning by establishing connections with switching system models and may even pave the way for new frameworks in the finite-time analysis of other reinforcement learning algorithms.

Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

TL;DR

, sharpness

, and problem parameters influence convergence, and they corroborate the theory with empirical simulations on a small MDP. Overall, the switching-system framework provides tractable, non-asymptotic convergence guarantees for entropy-regularized RL algorithms and suggests a path toward analyzing other reinforcement learning methods in finite time.

Abstract

Paper Structure (27 sections, 118 equations, 3 figures, 1 algorithm)

This paper contains 27 sections, 118 equations, 3 figures, 1 algorithm.

INTRODUCTION
PRELIMINARIES
Markov Decision Problem
Switching System
Soft Q-learning
Assumptions and Definitions
SWITCHING SYSTEM FOR Q-LEARNING
FINITE-TIME ANALYSIS OF LSE SOFT Q-LEARNING
Nonlinear System Representation of LSE Soft Q-learning
Lower Comparison System of LSE Soft Q-learning
Upper Comparison System of LSE Soft Q-learning
Error System for LSE Soft Q-learning System Analysis
FINITE-TIME ANALYSIS OF BOLTZMANN SOFT Q-LEARNING
Nonlinear System Representation of Boltzmann Soft Q-learning
Lower Comparison System of Boltzmann Soft Q-learning
...and 12 more sections

Figures (3)

Figure 1: We consider MDP that has $\mathcal{S}=\{1,2\}$, $\mathcal{A}=\{1,2\}$, state transition matrix for each action $p_1=[[0.5, 0.5], [0.9, 0.1]]$ and $p_2=[[0.6, 0.4], [0.3, 0.7]]$, and the reward function $r(1,1,1)=0.5$, $r(1,1,2)=1$, $r(1,2,2)=r(2,1,2)=r(2,2,1)=-0.5$ and $0$ in other cases.
Figure 2: Impact of $\beta$ on $\mathbb{E}[\|Q_{\infty}-Q^*\|_\infty]$ and the finite-time error bounds when $\alpha=0.001$
Figure 3: Impact of $\alpha$ on $\mathbb{E}[\|Q_{\infty}-Q^*\|_\infty]$ and the finite-time error bounds when $\beta=1000$

Theorems & Definitions (17)

proof
proof
proof
proof
proof
proof
proof
proof
proof
proof
...and 7 more

Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

TL;DR

Abstract

Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (17)