Table of Contents
Fetching ...

Automatic feature identification in least-squares policy iteration using the Koopman operator framework

Christian Mugisho Zagabe, Sebastian Peitz

Abstract

In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.

Automatic feature identification in least-squares policy iteration using the Koopman operator framework

Abstract

In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.

Paper Structure

This paper contains 13 sections, 32 equations, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: Architecture of the . In our case, the activation function $\tanh$ is used only between the two last layers in the encoder part, otherwise we use the with the rectified linear unit (ReLU): $f(x) = \max(0, x)$.
  • Figure 2: Policy evolution across iterations for (a) . (b) . (c) . Policies are represented in the radial direction. The white gap between states $1$ and $20$ indicates the absence of transitions between them. (d) dictionary size per iteration.
  • Figure 3: Policies by iteration for (a) . (b) . (c) . (d) dictionary size per iteration.
  • Figure 4: (a) performance. (b) performance. The red and black lines represent the worst and the best performances, respectively.