Table of Contents
Fetching ...

Explore the Context: Optimal Data Collection for Context-Conditional Dynamics Models

Jan Achterhold, Joerg Stueckler

TL;DR

This paper learns dynamics models for parametrized families of dynamical systems with varying properties formulated as stochastic processes conditioned on a latent context variable which is inferred from observed transitions of the respective system.

Abstract

In this paper, we learn dynamics models for parametrized families of dynamical systems with varying properties. The dynamics models are formulated as stochastic processes conditioned on a latent context variable which is inferred from observed transitions of the respective system. The probabilistic formulation allows us to compute an action sequence which, for a limited number of environment interactions, optimally explores the given system within the parametrized family. This is achieved by steering the system through transitions being most informative for the context variable. We demonstrate the effectiveness of our method for exploration on a non-linear toy-problem and two well-known reinforcement learning environments.

Explore the Context: Optimal Data Collection for Context-Conditional Dynamics Models

TL;DR

This paper learns dynamics models for parametrized families of dynamical systems with varying properties formulated as stochastic processes conditioned on a latent context variable which is inferred from observed transitions of the respective system.

Abstract

In this paper, we learn dynamics models for parametrized families of dynamical systems with varying properties. The dynamics models are formulated as stochastic processes conditioned on a latent context variable which is inferred from observed transitions of the respective system. The probabilistic formulation allows us to compute an action sequence which, for a limited number of environment interactions, optimally explores the given system within the parametrized family. This is achieved by steering the system through transitions being most informative for the context variable. We demonstrate the effectiveness of our method for exploration on a non-linear toy-problem and two well-known reinforcement learning environments.

Paper Structure

This paper contains 28 sections, 20 equations, 8 figures.

Figures (8)

  • Figure 1: (a) overview of our proposed calibration approach and (b) context-conditional dynamics model.
  • Figure 2: Characteristic behavior of the latent context belief encoder on the toy problem. First row: Depiction of the action squashing functions $\delta^{<1}(u_n)$, $\delta^{>1}(u_n)$ (effective action magnitude). Action regions which are informative for inferring the hidden parameter $\alpha$ are shaded in gray ($\delta(u_n) \neq 0$). Second row: Average entropy (normalized to $[0, 1]$) of the latent context belief $H[q(\beta|C=\{\bm{x},{u},\bm{x}^+\})]$ for actions ${u}$ in systems with Gaussian observation noise $\bm{\epsilon} \sim \mathcal{N}(0, \bm{I}\cdot(0.01)^2)$ (orange) and $\bm{\epsilon}=0$ (blue). Non-informative actions yield a high entropy of the latent context belief, for informative actions, the entropy negatively correlates to the effective action magnitude. Without observation noise, the entropy attains its minimum faster for increasing effective action magnitude as $\alpha$ can be inferred from low-magnitude actions with low variance.
  • Figure 3: Evaluation of model prediction error for the toy problem. Depicted is the prediction error (lower is better) of models with random and optimal (open-loop) calibration with $\{1,2,3\}$ calibration transitions, for both action squashing schemes $\delta^{<1}(u_n)$ (left) and $\delta^{>1}(u_n)$ (right).
  • Figure 4: Prediction error (lower is better) of the learned (a) Pendulum and (b) MountainCar models, either conditioned on calibration data obtained with a random rollout (blue), Open-Loop calibration (orange), MPC calibration (green). For the red curve, we train models without the strictly-decreasing variance constraint in the context encoder and perform MPC calibration. We plot mean (line) and $20\%$ / $80\%$ quantiles (shaded area, for random and MPC calibration only for visual clarity) over 3000 rollouts. Our proposed calibration schemes reduce prediction error compared to random calibration. MPC calibration compares favourably to Open-Loop calibration. Enforcing the decreasing variance constraint in the context encoder slightly reduces model error after calibration for Pendulum. For MountainCar, both model variants perform similarly. Calibration rollouts contain 30 transitions for the Pendulum and 50 transitions for the MountainCar environment.
  • Figure 5: Properties of the Pendulum environment
  • ...and 3 more figures