Model-based Policy Optimization using Symbolic World Model

Andrey Gorodetskiy; Konstantin Mironov; Aleksandr Panov

Model-based Policy Optimization using Symbolic World Model

Andrey Gorodetskiy, Konstantin Mironov, Aleksandr Panov

TL;DR

The paper addresses sample inefficiency in robotics by introducing a transformer-generated symbolic world model for model-based policy optimization. A collection of per-coordinate symbolic expressions, refined with BFGS, enables short synthetic rollouts that train a SAC policy, yielding improved data efficiency over model-free baselines and several MBPR methods in simulated tasks. The approach emphasizes interpretability of the dynamics and demonstrates performance gains on continuous control problems, while acknowledging scalability and inference challenges for high-dimensional systems. Overall, it suggests a promising direction for combining symbolic regression with MBRL to achieve data-efficient, interpretable control in robotics, with future work aimed at scalability and integration with more dynamic modeling components.

Abstract

The application of learning-based control methods in robotics presents significant challenges. One is that model-free reinforcement learning algorithms use observation data with low sample efficiency. To address this challenge, a prevalent approach is model-based reinforcement learning, which involves employing an environment dynamics model. We suggest approximating transition dynamics with symbolic expressions, which are generated via symbolic regression. Approximation of a mechanical system with a symbolic model has fewer parameters than approximation with neural networks, which can potentially lead to higher accuracy and quality of extrapolation. We use a symbolic dynamics model to generate trajectories in model-based policy optimization to improve the sample efficiency of the learning algorithm. We evaluate our approach across various tasks within simulated environments. Our method demonstrates superior sample efficiency in these tasks compared to model-free and model-based baseline methods.

Model-based Policy Optimization using Symbolic World Model

TL;DR

Abstract

Paper Structure (7 sections, 8 figures, 1 algorithm)

This paper contains 7 sections, 8 figures, 1 algorithm.

Introduction
Related work
Background
Method
Interpretability
Experiments
Conclusion

Figures (8)

Figure 1: The scheme of model-based policy optimization with symbolic model. We train a policy on samples from an environment model represented by a collection of symbolic expressions. These expressions are generated by a transformer model using observed transitions from the environment. The transformer model is pre-trained on a diverse dataset of randomly generated or environment-specific transition functions.
Figure 2: Sampling scheme of SAC training data. All transitions in the SAC buffer are generated using the world model. Any state that was observed during interaction with the environment is stored in the initial distribution buffer and can be sampled as the initial state for rollout generation.
Figure 3: Tree representation of the expression $\operatorname{clip}((\dot{\theta} + (((15.0 * \sin(\theta)) + (3.0 * \operatorname{clip}(u, 2.0))) * 0.05)), 8.0)$ for the angular velocity of the pendulum, generated by transformer model that was trained on pendulum dynamics.
Figure 4: Illustrations of environments. The Pendulum environment (\ref{['fig:frame-pendulum']}) represents the task of balancing a swinging pendulum in an upright position, the agent observes angle $\theta$ and angular velocity and controls the torque applied at the hinge. Reacher (\ref{['fig:frame-reacher']}) represents the task of reaching a target point by the tip of the two-link planar manipulator, agent observes joint positions and velocities, target point location, vector between the target and the manipulator's tip, and controls the torques applied at the joints. Car2d (\ref{['fig:frame-car2d']}) represents the task of parking at the target location a car that can move forward and steer in a plane, the agent observes position, orientation, velocity, steer, and target location, and controls acceleration and angular velocity of steering.
Figure 5: Method evaluation in the Pendulum environment. State space coverage shown for the collected dataset (train) and samples from joint policy-environment distributions (test) during agent training.
...and 3 more figures

Model-based Policy Optimization using Symbolic World Model

TL;DR

Abstract

Model-based Policy Optimization using Symbolic World Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)