Table of Contents
Fetching ...

Uncertainty Representations in State-Space Layers for Deep Reinforcement Learning under Partial Observability

Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters

TL;DR

This work addresses partial observability in reinforcement learning by introducing a standalone Kalman filter layer that performs explicit Gaussian inference in linear state-space models. The KF layer is designed as a drop-in recurrent-like module that can be trained end-to-end within a model-free off-policy actor-critic architecture and leverages parallel scans for scalable sequence processing. Across diverse POMDP benchmarks, KF layers consistently improve uncertainty reasoning, adaptation, and robustness to observation noise, often matching or surpassing deterministic SSMs and transformer-based models on key tasks. The results demonstrate the value of explicit probabilistic filtering in latent space for control under partial observability, while also outlining limitations and directions for future work such as exploring more complex process noise dynamics and larger architectures.

Abstract

Optimal decision-making under partial observability requires reasoning about the uncertainty of the environment's hidden state. However, most reinforcement learning architectures handle partial observability with sequence models that have no internal mechanism to incorporate uncertainty in their hidden state representation, such as recurrent neural networks, deterministic state-space models and transformers. Inspired by advances in probabilistic world models for reinforcement learning, we propose a standalone Kalman filter layer that performs closed-form Gaussian inference in linear state-space models and train it end-to-end within a model-free architecture to maximize returns. Similar to efficient linear recurrent layers, the Kalman filter layer processes sequential data using a parallel scan, which scales logarithmically with the sequence length. By design, Kalman filter layers are a drop-in replacement for other recurrent layers in standard model-free architectures, but importantly they include an explicit mechanism for probabilistic filtering of the latent state representation. Experiments in a wide variety of tasks with partial observability show that Kalman filter layers excel in problems where uncertainty reasoning is key for decision-making, outperforming other stateful models.

Uncertainty Representations in State-Space Layers for Deep Reinforcement Learning under Partial Observability

TL;DR

This work addresses partial observability in reinforcement learning by introducing a standalone Kalman filter layer that performs explicit Gaussian inference in linear state-space models. The KF layer is designed as a drop-in recurrent-like module that can be trained end-to-end within a model-free off-policy actor-critic architecture and leverages parallel scans for scalable sequence processing. Across diverse POMDP benchmarks, KF layers consistently improve uncertainty reasoning, adaptation, and robustness to observation noise, often matching or surpassing deterministic SSMs and transformer-based models on key tasks. The results demonstrate the value of explicit probabilistic filtering in latent space for control under partial observability, while also outlining limitations and directions for future work such as exploring more complex process noise dynamics and larger architectures.

Abstract

Optimal decision-making under partial observability requires reasoning about the uncertainty of the environment's hidden state. However, most reinforcement learning architectures handle partial observability with sequence models that have no internal mechanism to incorporate uncertainty in their hidden state representation, such as recurrent neural networks, deterministic state-space models and transformers. Inspired by advances in probabilistic world models for reinforcement learning, we propose a standalone Kalman filter layer that performs closed-form Gaussian inference in linear state-space models and train it end-to-end within a model-free architecture to maximize returns. Similar to efficient linear recurrent layers, the Kalman filter layer processes sequential data using a parallel scan, which scales logarithmically with the sequence length. By design, Kalman filter layers are a drop-in replacement for other recurrent layers in standard model-free architectures, but importantly they include an explicit mechanism for probabilistic filtering of the latent state representation. Experiments in a wide variety of tasks with partial observability show that Kalman filter layers excel in problems where uncertainty reasoning is key for decision-making, outperforming other stateful models.
Paper Structure (77 sections, 9 equations, 19 figures, 2 tables)

This paper contains 77 sections, 9 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: General Recurrent Actor-Critic (RAC) architecture. The components are trained end-to-end with the Soft Actor-Critic (SAC) loss function haarnoja_soft_2018. To handle discrete action spaces, we use the discrete version of SAC by christodoulou_soft_2019.
  • Figure 2: Our proposed Kalman filter layer to build history encoders. The KF layer receives a history sequence $h_{:t}$ and projects it into three separate signals in latent space: the input $u_{:t}$, the observation $w_{:t}$ and the observation noise (diagonal) covariance $\mathbf{\Sigma}^{\textnormal{o}}_{:t}$. These sequences are processed using the standard Kalman filtering equations, which scale logarithmically with the sequence length using parallel scans. Lastly, the posterior mean latent state $x^{+}_{:t}$ is projected from the latent space back into the history space to obtain the compressed representation $z_{:t}$.
  • Figure 3: Two example episodes of the Best Arm Identification task of \ref{['subsec:best_arm']}, with $\mu_b = 0.5$ and two different noise scales. (Left) Narrow noise distribution with $\sigma_b = 0.5$. (Right) Wide noise distribution with $\sigma_b = 1.0$. In red, we visualize the Bayesian posterior mean and $3\sigma$ confidence interval around $\mu_b$, obtained via Bayesian linear regression using all prior observations in the episode.
  • Figure 4: Performance of sequence models in the Best Arm Identification problem after $500\textnormal{K}$ environment steps. We conduct experiments for increasing cost of requesting new observations and evaluate performance both in and out of distribution, averaged over 100 episodes, and report the mean and standard error over 5 random seeds. (Top row) Normalized return, obtained by dividing returns by the reward given after winning (10 in our case). (Bottom row) Length of episodes.
  • Figure 5: Performance heatmap on Best Arm Identification problem ($\rho=0$). We generate a grid of noise parameters $(\mu_b, \sigma_b)$ for a total of 625 unique combinations. The red vertical line separates training (to the left) from out-of-distribution (to the right) latent parameters. For each pair of latent parameters, we evaluate performance on five independently trained agents over 100 episodes and report the average win rate and episode lengths.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Definition 1: Masked Associative Operator