Table of Contents
Fetching ...

Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning

Aritra Mitra, George J. Pappas, Hamed Hassani

TL;DR

The paper addresses the robustness of TD-based reinforcement learning to compressed, biased update directions by introducing EF-TD, a TD(0) variant with error-feedback memory. It develops a general theoretical framework for compressed stochastic approximation under Markovian sampling, establishing non-asymptotic convergence guarantees that extend to nonlinear SA and Q-learning. It then shows that multi-agent RL can achieve linear speedups with communication near-constant per agent per iteration, thanks to a careful Lyapunov-based analysis and variance-reduction techniques. The results provide a principled foundation for communication-efficient, scalable RL in MARL and FRL settings, with novel insights into how slowly mixing Markov chains interact with distortion and error-feedback.

Abstract

In large-scale distributed machine learning, recent works have studied the effects of compressing gradients in stochastic optimization to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? We investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our work makes three important technical contributions. First, we prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. Second, we show that our analysis framework extends seamlessly to nonlinear stochastic approximation schemes that subsume Q-learning. Third, we prove that for multi-agent TD learning, one can achieve linear convergence speedups with respect to the number of agents while communicating just $\tilde{O}(1)$ bits per iteration. Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our proofs hinge on the construction of novel Lyapunov functions that capture the dynamics of a memory variable introduced by error-feedback.

Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning

TL;DR

The paper addresses the robustness of TD-based reinforcement learning to compressed, biased update directions by introducing EF-TD, a TD(0) variant with error-feedback memory. It develops a general theoretical framework for compressed stochastic approximation under Markovian sampling, establishing non-asymptotic convergence guarantees that extend to nonlinear SA and Q-learning. It then shows that multi-agent RL can achieve linear speedups with communication near-constant per agent per iteration, thanks to a careful Lyapunov-based analysis and variance-reduction techniques. The results provide a principled foundation for communication-efficient, scalable RL in MARL and FRL settings, with novel insights into how slowly mixing Markov chains interact with distortion and error-feedback.

Abstract

In large-scale distributed machine learning, recent works have studied the effects of compressing gradients in stochastic optimization to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? We investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our work makes three important technical contributions. First, we prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. Second, we show that our analysis framework extends seamlessly to nonlinear stochastic approximation schemes that subsume Q-learning. Third, we prove that for multi-agent TD learning, one can achieve linear convergence speedups with respect to the number of agents while communicating just bits per iteration. Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our proofs hinge on the construction of novel Lyapunov functions that capture the dynamics of a memory variable introduced by error-feedback.
Paper Structure (19 sections, 33 theorems, 222 equations, 4 figures, 2 algorithms)

This paper contains 19 sections, 33 theorems, 222 equations, 4 figures, 2 algorithms.

Key Result

Theorem 1

Suppose Assumption ass:Markov holds. There exist universal constants $c, C \geq 1$ such that the iterates generated by EF-TD with step-size $\alpha \leq \frac{\omega (1-\gamma)}{c \max\{\delta, \tau\}}$ satisfy the following $\forall T\geq \tau$:

Figures (4)

  • Figure 1: A geometric interpretation of the operator $\mathcal{Q}_{\delta}(\cdot)$ in Algorithm \ref{['algo:algo1']}. A larger $\delta$ induces more distortion.
  • Figure 2: Plots of the mean-squared error $E_t=\Vert \theta_T-\theta^* \Vert^2_2$ for vanilla TD(0) without compression, and SignTD(0) with (EF-SignTD) and without (SignTD) error-feedback. (Left) Discount factor $\gamma=0.5$. (Right) Discount factor $\gamma=0.9$.
  • Figure 3: Plot of the mean-squared error $E_t=\Vert \theta_T-\theta^* \Vert^2_2$ for EF-TD (Algo. \ref{['algo:algo1']}), with $\mathcal{Q}_{\delta}(\cdot)$ chosen to be the top-$k$ operator. We study the effect of varying the number of components transmitted $k$.
  • Figure 4: Plots of the mean-squared error $E_t=\Vert \theta_T-\theta^* \Vert^2_2$ for multi-agent EF-TD (Algorithm \ref{['algo:algo2']}). (Left) $\mathcal{Q}_{\delta}(\cdot)$ is the sign operator. (Right) $\mathcal{Q}_{\delta}(\cdot)$ is the Top-$k$ operator with $k=2$.

Theorems & Definitions (60)

  • Remark 1
  • Definition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Remark 2
  • Remark 3
  • Remark 4
  • Lemma 1
  • Lemma 2
  • ...and 50 more