Table of Contents
Fetching ...

R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models

Viet Dung Nguyen, Zhizhuo Yang, Christopher L. Buckley, Alexander Ororbia

TL;DR

The paper tackles sparse-reward, continuous-action robotic control in pixel-based POMDPs by extending active inference with R-AIF, which combines a recurrent world model (RSSM) with a dynamically learned prior (CRSPP) and an actor-critic planner. Actions are chosen to minimize the expected free energy $G_\tau(\pi)$ in imagined futures, incorporating instrumental rewards, epistemic curiosity via an information gain ensemble, and a self-revision mechanism that adaptively shapes goals. The key contributions are the CRSPP prior, the robust self-revision signaling, the dynamic EFE formulation, and the network-ensemble approach for information gain, all integrated into an off-policy training pipeline. Empirically, R-AIF converges faster and with higher final performance and stability than DreamerV3 and prior AIF baselines across Mountain Car, Meta-World, and robosuite pixel tasks, demonstrating improved robustness and data efficiency for high-dimensional, partially observable robotics. These results indicate that adaptive priors and principled information-seeking behavior can substantially enhance active inference in real-world-like control problems.

Abstract

Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate. The code in support of this work can be found at https://github.com/NACLab/robust-active-inference.

R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models

TL;DR

The paper tackles sparse-reward, continuous-action robotic control in pixel-based POMDPs by extending active inference with R-AIF, which combines a recurrent world model (RSSM) with a dynamically learned prior (CRSPP) and an actor-critic planner. Actions are chosen to minimize the expected free energy in imagined futures, incorporating instrumental rewards, epistemic curiosity via an information gain ensemble, and a self-revision mechanism that adaptively shapes goals. The key contributions are the CRSPP prior, the robust self-revision signaling, the dynamic EFE formulation, and the network-ensemble approach for information gain, all integrated into an off-policy training pipeline. Empirically, R-AIF converges faster and with higher final performance and stability than DreamerV3 and prior AIF baselines across Mountain Car, Meta-World, and robosuite pixel tasks, demonstrating improved robustness and data efficiency for high-dimensional, partially observable robotics. These results indicate that adaptive priors and principled information-seeking behavior can substantially enhance active inference in real-world-like control problems.

Abstract

Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate. The code in support of this work can be found at https://github.com/NACLab/robust-active-inference.
Paper Structure (18 sections, 21 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 18 sections, 21 equations, 5 figures, 1 table, 2 algorithms.

Figures (5)

  • Figure 1: Demonstration of different abstract trajectories with state (y-axis) through time (x-axis). Behavior cloning trajectory diverges from the training trajectory due to small mistakes made by the agent as well as environmental stochasticity (due to the i.i.d assumption applied to the environment). We instead want to estimate a "preferred" trajectory that closely matches the underlying data distribution. As a result, our R-AIF agent can "nudge" its trajectory toward its own prior preference. The dashed lines around the line R-AIF represents the epistemic signal of the agent that facilitates intelligent exploration within a safe range.
  • Figure 2: The CRSPP learning framework. CRSPP learns by optimizing the KL divergence between its approximate posterior and prior only when a state is "desired", i.e. $\rho_t > 0$. It also learns to predict next preferred states using a dynamic contrastive loss based on $\rho_t$ (which focuses on narrowing the gap between the estimated preferred state distribution and the actual approximate posterior produced by the RSSM).
  • Figure 3: Actual observation (top row) versus the prior preference estimation (bottom row) across time (horizontal axis) of the mountain car problem (top image group) and the Meta-World 'button press wall' task (bottom image group). We see that CRSPP produces a goal dynamically at each time step.
  • Figure 4: Cumulative reward ($y$-axis) trend through environment time steps ($x$-axis) of different agents. Pink dashed lines are average reward of the expert in the MDP version of the task.
  • Figure 6: The temporal generative dynamics model. A depiction of the generative model that R-AIF uses to make use of past information; this is equivalent to a RSSM operating in latent space.