Table of Contents
Fetching ...

Data-Efficient Quadratic Q-Learning Using LMIs

J. S. van Hulst, W. P. M. H. Heemels, D. J. Antunes

TL;DR

The paper tackles data inefficiency in off-policy Q-learning by introducing two methods, LMI-QL and LMI-QLi, that learn a Q-function that is linear in parameters and quadratic in a state/control basis. By minimizing the $\ell_1$-norm of Bellman residuals and leveraging convex optimization, including semidefinite programming with LMIs and iterative SDP updates, the approaches achieve fast convergence with limited data. The Q-function is parameterized around a baseline policy and a basis $\phi(x)$, enabling a tractable, direct learning of the optimal Q-function while guaranteeing convex subproblems. A nonlinear pendulum case study demonstrates superior data efficiency and competitive performance compared to LSPI and model-based control, highlighting practical impact for rapid, data-light RL in continuous control tasks.

Abstract

Reinforcement learning (RL) has seen significant research and application results but often requires large amounts of training data. This paper proposes two data-efficient off-policy RL methods that use parametrized Q-learning. In these methods, the Q-function is chosen to be linear in the parameters and quadratic in selected basis functions in the state and control deviations from a base policy. A cost penalizing the $\ell_1$-norm of Bellman errors is minimized. We propose two methods: Linear Matrix Inequality Q-Learning (LMI-QL) and its iterative variant (LMI-QLi), which solve the resulting episodic optimization problem through convex optimization. LMI-QL relies on a convex relaxation that yields a semidefinite programming (SDP) problem with linear matrix inequalities (LMIs). LMI-QLi entails solving sequential iterations of an SDP problem. Both methods combine convex optimization with direct Q-function learning, significantly improving learning speed. A numerical case study demonstrates their advantages over existing parametrized Q-learning methods.

Data-Efficient Quadratic Q-Learning Using LMIs

TL;DR

The paper tackles data inefficiency in off-policy Q-learning by introducing two methods, LMI-QL and LMI-QLi, that learn a Q-function that is linear in parameters and quadratic in a state/control basis. By minimizing the -norm of Bellman residuals and leveraging convex optimization, including semidefinite programming with LMIs and iterative SDP updates, the approaches achieve fast convergence with limited data. The Q-function is parameterized around a baseline policy and a basis , enabling a tractable, direct learning of the optimal Q-function while guaranteeing convex subproblems. A nonlinear pendulum case study demonstrates superior data efficiency and competitive performance compared to LSPI and model-based control, highlighting practical impact for rapid, data-light RL in continuous control tasks.

Abstract

Reinforcement learning (RL) has seen significant research and application results but often requires large amounts of training data. This paper proposes two data-efficient off-policy RL methods that use parametrized Q-learning. In these methods, the Q-function is chosen to be linear in the parameters and quadratic in selected basis functions in the state and control deviations from a base policy. A cost penalizing the -norm of Bellman errors is minimized. We propose two methods: Linear Matrix Inequality Q-Learning (LMI-QL) and its iterative variant (LMI-QLi), which solve the resulting episodic optimization problem through convex optimization. LMI-QL relies on a convex relaxation that yields a semidefinite programming (SDP) problem with linear matrix inequalities (LMIs). LMI-QLi entails solving sequential iterations of an SDP problem. Both methods combine convex optimization with direct Q-function learning, significantly improving learning speed. A numerical case study demonstrates their advantages over existing parametrized Q-learning methods.
Paper Structure (12 sections, 25 equations, 1 figure, 2 algorithms)

This paper contains 12 sections, 25 equations, 1 figure, 2 algorithms.

Figures (1)

  • Figure 1: Mean cumulative reward and 95% confidence interval for a 100-sample simulation against the number of data points used. We compare LQR with feedback linearization, LMI-QL, LMI-QLi and LSPI.

Theorems & Definitions (1)

  • Remark 1