Table of Contents
Fetching ...

Nonparametric Bellman Mappings for Reinforcement Learning: Application to Robust Adaptive Filtering

Yuki Akiyama, Minh Vu, Konstantinos Slavakis

TL;DR

The paper tackles robust online reinforcement learning for adaptive filtering in the presence of unknown outlier statistics by introducing nonparametric Bellman mappings defined in reproducing kernel Hilbert spaces (RKHSs). A variational framework is developed to tune the free parameters of these mappings, revealing that several established Bellman-mapping designs (e.g., LSPE, BR) are special cases of the proposed family. The method supports on-the-fly trajectory sampling, experience replay, and dimensionality reduction via random Fourier features, enabling efficient online or time-adaptive learning with a discrete action grid that selects the per-step loss power parameter $p$. The approach is applied to robust adaptive filtering by online selection of $p$ in the least-mean-$p$-power method, showing superior performance on synthetic data against various RL and non-RL schemes while remaining distribution-free and training-data-free. Overall, the work provides a principled, scalable RKHS-based RL framework with strong contraction/consistency guarantees and broad potential for online robust decision-making under uncertain data statistics.

Abstract

This paper designs novel nonparametric Bellman mappings in reproducing kernel Hilbert spaces (RKHSs) for reinforcement learning (RL). The proposed mappings benefit from the rich approximating properties of RKHSs, adopt no assumptions on the statistics of the data owing to their nonparametric nature, require no knowledge on transition probabilities of Markov decision processes, and may operate without any training data. Moreover, they allow for sampling on-the-fly via the design of trajectory samples, re-use past test data via experience replay, effect dimensionality reduction by random Fourier features, and enable computationally lightweight operations to fit into efficient online or time-adaptive learning. The paper offers also a variational framework to design the free parameters of the proposed Bellman mappings, and shows that appropriate choices of those parameters yield several popular Bellman-mapping designs. As an application, the proposed mappings are employed to offer a novel solution to the problem of countering outliers in adaptive filtering. More specifically, with no prior information on the statistics of the outliers and no training data, a policy-iteration algorithm is introduced to select online, per time instance, the ``optimal'' coefficient p in the least-mean-p-power-error method. Numerical tests on synthetic data showcase, in most of the cases, the superior performance of the proposed solution over several RL and non-RL schemes.

Nonparametric Bellman Mappings for Reinforcement Learning: Application to Robust Adaptive Filtering

TL;DR

The paper tackles robust online reinforcement learning for adaptive filtering in the presence of unknown outlier statistics by introducing nonparametric Bellman mappings defined in reproducing kernel Hilbert spaces (RKHSs). A variational framework is developed to tune the free parameters of these mappings, revealing that several established Bellman-mapping designs (e.g., LSPE, BR) are special cases of the proposed family. The method supports on-the-fly trajectory sampling, experience replay, and dimensionality reduction via random Fourier features, enabling efficient online or time-adaptive learning with a discrete action grid that selects the per-step loss power parameter . The approach is applied to robust adaptive filtering by online selection of in the least-mean--power method, showing superior performance on synthetic data against various RL and non-RL schemes while remaining distribution-free and training-data-free. Overall, the work provides a principled, scalable RKHS-based RL framework with strong contraction/consistency guarantees and broad potential for online robust decision-making under uncertain data statistics.

Abstract

This paper designs novel nonparametric Bellman mappings in reproducing kernel Hilbert spaces (RKHSs) for reinforcement learning (RL). The proposed mappings benefit from the rich approximating properties of RKHSs, adopt no assumptions on the statistics of the data owing to their nonparametric nature, require no knowledge on transition probabilities of Markov decision processes, and may operate without any training data. Moreover, they allow for sampling on-the-fly via the design of trajectory samples, re-use past test data via experience replay, effect dimensionality reduction by random Fourier features, and enable computationally lightweight operations to fit into efficient online or time-adaptive learning. The paper offers also a variational framework to design the free parameters of the proposed Bellman mappings, and shows that appropriate choices of those parameters yield several popular Bellman-mapping designs. As an application, the proposed mappings are employed to offer a novel solution to the problem of countering outliers in adaptive filtering. More specifically, with no prior information on the statistics of the outliers and no training data, a policy-iteration algorithm is introduced to select online, per time instance, the ``optimal'' coefficient p in the least-mean-p-power-error method. Numerical tests on synthetic data showcase, in most of the cases, the superior performance of the proposed solution over several RL and non-RL schemes.
Paper Structure (25 sections, 16 theorems, 124 equations, 7 figures, 1 algorithm)

This paper contains 25 sections, 16 theorems, 124 equations, 7 figures, 1 algorithm.

Key Result

Proposition 1

(Variational framework for B-Maps) Consider the user-defined loss function $\mathcal{L}\colon \mathbb{R}^N \times \mathbb{R}^{ N\times N_{\textnormal{av}} } \to \mathbb{R} \colon (\boldsymbol{\gamma}, \boldsymbol{\Upsilon}) \mapsto \mathcal{L}(\boldsymbol{\gamma}, \boldsymbol{\Upsilon})$ and the reg where $\sigma \in \mathbb{R}_{+}$.

Figures (7)

  • Figure 1: RL as a sequential-decision-making framework: Identify the agent's policy $\mu(\cdot)$ (a decision- or action-making function) which minimizes the total loss ($=$ one-step loss $+$ long-term loss) to be paid by the agent for its sequence of decisions/actions.
  • Figure 2: Scenario 1 (\ref{['sec:sce1']}): \ref{['algo']} against LMP. : \ref{['algo']} with $\alpha=0.9, N[n]\geq 1$. Marks , , , , correspond to \ref{['LMP']} with $p=1, 1.25, 1.5, 1.75, 2$, respectively. Mark denotes an algorithm which randomly chooses $p$, $\forall n$.
  • Figure 3: Scenario 1 (\ref{['sec:sce1']}): \ref{['algo']} against non-RL- and RL-based methods. : \ref{['algo']} with $\alpha=0.9, N[n] \geq 1$. : \ref{['algo']} with $\alpha=0.9, N[n] = 1$. : Kernel-based TD(0) with $\alpha = 0.9$Bae:kerneltd1:11. : vazquez2012 with $p=1, \gamma_1 = 0.9, \gamma_2 = 0.99$. : KLSPI with $\alpha=0.9$xu2007klspi. : mixed norm chambers1997robust. : the predecessor minh:icassp23 of the current work. : VKW-MCC huang2017adaptive.
  • Figure 4: Scenario 1 (\ref{['sec:sce1']}): Versions of \ref{['algo']} under several parameter settings. : $\alpha = 0.9, N[n]\geq 1$. : $\alpha=0.75, N[n]\geq 1$. : $\alpha = 0$. : $\alpha=0.9, N[n]=1$. : $\alpha=0.75, N[n]=1$.
  • Figure 5: Scenario 2 (\ref{['sec:sce2']}): \ref{['algo']} against LMP. : \ref{['algo']} with $\alpha=0.9, N[n]\geq1$. Marks , , , , correspond to \ref{['LMP']} with $p=1, 1.25, 1.5, 1.75, 2$, respectively. Mark denotes an algorithm which randomly chooses $p$, $\forall n$.
  • ...and 2 more figures

Theorems & Definitions (16)

  • Proposition 1
  • Theorem 2
  • Theorem 4
  • Theorem 6
  • Theorem 8
  • Theorem 10
  • Lemma 11
  • Theorem 12
  • Lemma 13
  • Lemma 14
  • ...and 6 more