Table of Contents
Fetching ...

Real-World Robot Control by Deep Active Inference With a Temporally Hierarchical World Model

Kentaro Fujii, Shingo Murata

TL;DR

Real-world robot control under uncertainty requires balancing goal-directed actions with exploration. The authors propose a deep active inference framework featuring a temporally hierarchical world model, a vector-quantized action model, and an abstract world model to enable tractable planning and exploration. Real-world experiments demonstrate high success rates across object-manipulation tasks and the ability to switch between goal-directed and exploratory behaviors under uncertainty, while substantially reducing action-selection cost compared with conventional approaches. This work highlights the value of multi-timescale dynamics and action/state abstraction for robust, real-world robotic systems.

Abstract

Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.

Real-World Robot Control by Deep Active Inference With a Temporally Hierarchical World Model

TL;DR

Real-world robot control under uncertainty requires balancing goal-directed actions with exploration. The authors propose a deep active inference framework featuring a temporally hierarchical world model, a vector-quantized action model, and an abstract world model to enable tractable planning and exploration. Real-world experiments demonstrate high success rates across object-manipulation tasks and the ability to switch between goal-directed and exploratory behaviors under uncertainty, while substantially reducing action-selection cost compared with conventional approaches. This work highlights the value of multi-timescale dynamics and action/state abstraction for robust, real-world robotic systems.

Abstract

Robots in uncertain real-world environments must perform both goal-directed and exploratory actions. However, most deep learning-based control methods neglect exploration and struggle under uncertainty. To address this, we adopt deep active inference, a framework that accounts for human goal-directed and exploratory actions. Yet, conventional deep active inference approaches face challenges due to limited environmental representation capacity and high computational cost in action selection. We propose a novel deep active inference framework that consists of a world model, an action model, and an abstract world model. The world model encodes environmental dynamics into hidden state representations at slow and fast timescales. The action model compresses action sequences into abstract actions using vector quantization, and the abstract world model predicts future slow states conditioned on the abstract action, enabling low-cost action selection. We evaluate the framework on object-manipulation tasks with a real-world robot. Results show that it achieves high success rates across diverse manipulation tasks and switches between goal-directed and exploratory actions in uncertain settings, while making action selection computationally tractable. These findings highlight the importance of modeling multiple timescale dynamics and abstracting actions and state transitions.

Paper Structure

This paper contains 24 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The overview of the proposed framework. The framework comprises a world model, an action model, and an abstract world model. Here, key variables are visualized: observation $o_t$ and action $a_t$ are processed by the world model to infer hierarchical hidden states $z_t^\mathrm{s}$, $z_t^\mathrm{f}$. The action model compresses action sequences into abstract actions $A_t$. The abstract world model uses $A_t$ to predict the future slow deterministic state $d_{t+h}^\mathrm{s}$.
  • Figure 2: The world model. It consists of a dynamics model, an encoder, and a decoder. The dynamics model has two different timescales.
  • Figure 3: Action selection based on the minimization of EFE. First, future states are predicted for multiple abstract actions. Then, the EFE is calculated for each of the predicted future states. Finally, the robot execute action sequence reconstructed from the abstract action that yields the lowest EFE.
  • Figure 4: Experimental environment (left) and policy patterns included in the collected dataset (right). The environment contains either a blue ball, a red ball, or both. The dataset includes demonstrations of eight different policy patterns involving the movement of the lid and the balls.
  • Figure 5: Example of predicted observations using the abstract world model and actual robot actions. (A) Predicted observations for each abstract action. Here, each $c_{i,j}$ denotes the $j$‑th code in the $i$‑th layer of the action model. The yellow box highlights an example prediction that is consistent with the initial observation, while the red box indicates an inconsistent prediction. (B) Actual observations corresponding to the action sequence generated from the abstract action $\hat{A}$ represented by $\hat{A}=c_{1,2} + c_{2,7}$ at each time step.
  • ...and 1 more figures