Table of Contents
Fetching ...

Reinforcement Learning for Ballbot Navigation in Uneven Terrain

Achkan Salehi

TL;DR

This work addresses robust ballbot navigation on uneven terrain by elevating RL with exteroceptive depth sensing and carefully shaped rewards. It introduces an open-source MuJoCo-based ballbot simulator and demonstrates that a classical model-free PPO agent can learn to navigate diverse, unseen terrains using roughly 4–5 hours of simulated data at 500 Hz. The key contributions are (i) an RL-friendly, open-source simulation environment for ballbots, (ii) an observation pipeline combining proprioception with depth-vision embeddings, and (iii) empirical evidence of generalization and practical performance (~0.5 m/s) in challenging terrains. The findings suggest RL-based ballbot control is viable for real-world deployment, with future work focusing on data efficiency and sim-to-real transfer.

Abstract

Ballbot (i.e. Ball balancing robot) navigation usually relies on methods rooted in control theory (CT), and works that apply Reinforcement learning (RL) to the problem remain rare while generally being limited to specific subtasks (e.g. balance recovery). Unlike CT based methods, RL does not require (simplifying) assumptions about environment dynamics (e.g. the absence of slippage between the ball and the floor). In addition to this increased accuracy in modeling, RL agents can easily be conditioned on additional observations such as depth-maps without the need for explicit formulations from first principles, leading to increased adaptivity. Despite those advantages, there has been little to no investigation into the capabilities, data-efficiency and limitations of RL based methods for ballbot control and navigation. Furthermore, there is a notable absence of an open-source, RL-friendly simulator for this task. In this paper, we present an open-source ballbot simulation based on MuJoCo, and show that with appropriate conditioning on exteroceptive observations as well as reward shaping, policies learned by classical model-free RL methods are capable of effectively navigating through randomly generated uneven terrain, using a reasonable amount of data (four to five hours on a system operating at 500hz).

Reinforcement Learning for Ballbot Navigation in Uneven Terrain

TL;DR

This work addresses robust ballbot navigation on uneven terrain by elevating RL with exteroceptive depth sensing and carefully shaped rewards. It introduces an open-source MuJoCo-based ballbot simulator and demonstrates that a classical model-free PPO agent can learn to navigate diverse, unseen terrains using roughly 4–5 hours of simulated data at 500 Hz. The key contributions are (i) an RL-friendly, open-source simulation environment for ballbots, (ii) an observation pipeline combining proprioception with depth-vision embeddings, and (iii) empirical evidence of generalization and practical performance (~0.5 m/s) in challenging terrains. The findings suggest RL-based ballbot control is viable for real-world deployment, with future work focusing on data efficiency and sim-to-real transfer.

Abstract

Ballbot (i.e. Ball balancing robot) navigation usually relies on methods rooted in control theory (CT), and works that apply Reinforcement learning (RL) to the problem remain rare while generally being limited to specific subtasks (e.g. balance recovery). Unlike CT based methods, RL does not require (simplifying) assumptions about environment dynamics (e.g. the absence of slippage between the ball and the floor). In addition to this increased accuracy in modeling, RL agents can easily be conditioned on additional observations such as depth-maps without the need for explicit formulations from first principles, leading to increased adaptivity. Despite those advantages, there has been little to no investigation into the capabilities, data-efficiency and limitations of RL based methods for ballbot control and navigation. Furthermore, there is a notable absence of an open-source, RL-friendly simulator for this task. In this paper, we present an open-source ballbot simulation based on MuJoCo, and show that with appropriate conditioning on exteroceptive observations as well as reward shaping, policies learned by classical model-free RL methods are capable of effectively navigating through randomly generated uneven terrain, using a reasonable amount of data (four to five hours on a system operating at 500hz).

Paper Structure

This paper contains 12 sections, 4 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Screenshots from our open-source simulation, where the learned policy navigates through randomly generated uneven terrain. Note that for simplicity, the three omniwheels controlling the base sphere are modeled as capsules with anisotropic tangential friction. Two low-resolution depth cameras (visible as yellow cones and noted $C_0, C_1$ in the bottom right image) enable terrain perception. They are both oriented towards the contact point between the ball and the ground.
  • Figure 2: A CAD model of a ballbot that uses three omniwheels as its ball drive mechanism.
  • Figure 3: Using only proprioceptive observations in uneven terrain leads to ambiguities: assuming that the robots in (a), (b), (c) have the same state at time $t$ (in the figure, only the velocity $v_t$ is explicitly shown for clarity) and adopting a model-based view, it is clear that any learned model of the form $s_{t+1}=M(a_t,s_t)$ will be ambiguous, as the state information is not sufficient for predicting whether or not the robot will encounter a flat terrain or an increasing/decreasing slope at $t+1$. This ambiguity can not be fully resolved by recurrent models or distributional methods based only on proprioceptive data, but can be alleviated by incorporating observations from exteroceptive sensors, such as depth cameras.
  • Figure 4: The policy used to navigate uneven terrain. Two low resolution $128\times128$ depth images, one from each depth camera, are fed to a pretrained encoder that maps each of them to an embedding in $\mathbb{R}^{20}$. These embeddings are then concatenated to the proprioceptive observations: orientation, angular velocity, body velocity, angular velocities of each omniwheel (the latter are in local coordinates), as well as the last command vector sent at the previous timestep. To avoid state ambiguities arising from the differences in observation frequencies ($500$hz for proprioceptive readings vs $\sim 80hz$ for the depth cameras), we also concatenate the time $\Delta t$ that has elapsed since the last depth image observation was received. This $56$ dimensional vector is then fed into a small MLP which is trained to predict torque commands that are sent to the omniwheel motors.
  • Figure 5: (left) The ballbot from our open-source simulation. (Right) The omniwheels are modeled as capsules with anisotropic tangential friction. Friction is high along the $T_1$ axis, which is the direction along which the wheel applies torque to the sphere. Friction in the $T_2$ direction, which corresponds to the rotation direction of the omniwheel's idler rollers, should be negligible.
  • ...and 3 more figures