Table of Contents
Fetching ...

CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

Yannick Denker, Alexander Gepperth

TL;DR

CRoSS introduces a Gazebo-based continual reinforcement learning benchmark for robotics, addressing realism, task diversity, and out-of-the-box usability gaps in prior CRL suites. It combines two platforms—a 2-wheeled differential-drive robot and a 7-DOF robot arm—with six benchmarks that cover cartesian and joint-space control, plus fast, kinematic variants to accelerate experimentation. Baseline RL methods like DQN and REINFORCE exhibit clear catastrophic forgetting across sequential tasks, while the kinematic variants enable rapid hyperparameter search without sacrificing task structure or transferability to the simulated physics. The benchmark emphasizes reproducibility and sim-to-real potential via containerized deployment and ROS-Gazebo interoperability, making CRoSS a scalable, extensible foundation for advancing CRL in robotics and prospective real-world deployments.

Abstract

Continual reinforcement learning (CRL) requires agents to learn from a sequence of tasks without forgetting previously acquired policies. In this work, we introduce a novel benchmark suite for CRL based on realistically simulated robots in the Gazebo simulator. Our Continual Robotic Simulation Suite (CRoSS) benchmarks rely on two robotic platforms: a two-wheeled differential-drive robot with lidar, camera and bumper sensor, and a robotic arm with seven joints. The former represent an agent in line-following and object-pushing scenarios, where variation of visual and structural parameters yields a large number of distinct tasks, whereas the latter is used in two goal-reaching scenarios with high-level cartesian hand position control (modeled after the Continual World benchmark), and low-level control based on joint angles. For the robotic arm benchmarks, we provide additional kinematics-only variants that bypass the need for physical simulation (as long as no sensor readings are required), and which can be run two orders of magnitude faster. CRoSS is designed to be easily extensible and enables controlled studies of continual reinforcement learning in robotic settings with high physical realism, and in particular allow the use of almost arbitrary simulated sensors. To ensure reproducibility and ease of use, we provide a containerized setup (Apptainer) that runs out-of-the-box, and report performances of standard RL algorithms, including Deep Q-Networks (DQN) and policy gradient methods. This highlights the suitability as a scalable and reproducible benchmark for CRL research.

CRoSS: A Continual Robotic Simulation Suite for Scalable Reinforcement Learning with High Task Diversity and Realistic Physics Simulation

TL;DR

CRoSS introduces a Gazebo-based continual reinforcement learning benchmark for robotics, addressing realism, task diversity, and out-of-the-box usability gaps in prior CRL suites. It combines two platforms—a 2-wheeled differential-drive robot and a 7-DOF robot arm—with six benchmarks that cover cartesian and joint-space control, plus fast, kinematic variants to accelerate experimentation. Baseline RL methods like DQN and REINFORCE exhibit clear catastrophic forgetting across sequential tasks, while the kinematic variants enable rapid hyperparameter search without sacrificing task structure or transferability to the simulated physics. The benchmark emphasizes reproducibility and sim-to-real potential via containerized deployment and ROS-Gazebo interoperability, making CRoSS a scalable, extensible foundation for advancing CRL in robotics and prospective real-world deployments.

Abstract

Continual reinforcement learning (CRL) requires agents to learn from a sequence of tasks without forgetting previously acquired policies. In this work, we introduce a novel benchmark suite for CRL based on realistically simulated robots in the Gazebo simulator. Our Continual Robotic Simulation Suite (CRoSS) benchmarks rely on two robotic platforms: a two-wheeled differential-drive robot with lidar, camera and bumper sensor, and a robotic arm with seven joints. The former represent an agent in line-following and object-pushing scenarios, where variation of visual and structural parameters yields a large number of distinct tasks, whereas the latter is used in two goal-reaching scenarios with high-level cartesian hand position control (modeled after the Continual World benchmark), and low-level control based on joint angles. For the robotic arm benchmarks, we provide additional kinematics-only variants that bypass the need for physical simulation (as long as no sensor readings are required), and which can be run two orders of magnitude faster. CRoSS is designed to be easily extensible and enables controlled studies of continual reinforcement learning in robotic settings with high physical realism, and in particular allow the use of almost arbitrary simulated sensors. To ensure reproducibility and ease of use, we provide a containerized setup (Apptainer) that runs out-of-the-box, and report performances of standard RL algorithms, including Deep Q-Networks (DQN) and policy gradient methods. This highlights the suitability as a scalable and reproducible benchmark for CRL research.
Paper Structure (39 sections, 5 equations, 5 figures, 15 tables)

This paper contains 39 sections, 5 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Simulated robots. Left: two-wheeled $3\pi$ robot. Lidar sensors and camera objects are just visual placeholders to indicate their positions. Right: 7-d.o.f. Franka Emika Panda arm.
  • Figure 2: Benchmarks for the two-wheeled robot used in this study. Left: multi-task pushing objects (MPO), the robot is approaching geometrical objects with different shapes, colors, and symbols projected onto their faces. Right: multi-task line following (MLF), the robot drives on a ground plane covered with colored lines and limited by a wall. It is supposed to follow the slim centered lines. The robot is the same in both benchmarks but uses a downwards-looking line camera in the MLF benchmark, whereas a forward-looking RGB camera is used in the MPO benchmark.
  • Figure 3: Sensor data used for learning in the MLF and MPO benchmarks. Left: two default-setting 20x20 RGB images and two simplified 15x3 mono images from the MPO benchmark. Simplified inputs are population-coded, meaning that the position of the single bright pixel indicates a sensor reading. The uppermost row is split into three population codes for color, shape and symbol of the currently approached object, whereas the next rows encode object distance and viewing angle. Right: Two default-setting 100x2 composite images and two simplified 18x3 images from the MLF benchmark. Composite images integrate line camera data (upper row) and lidar distance to wall (lower row), the latter again population-coded. Simplified images contain three population codes for left/middle/right line color (upper row), distance to wall (middle row) and centering of the left border of the middle line in the image (population-coded). The simplified inputs are helpful because they simplify the problems without sacrificing their massively continual character.
  • Figure 4: Average task accuracy across sequentially introduced tasks in the HLR benchmark with three different buffer sizes. As new tasks are added during training, performance on earlier tasks declines, illustrating the effect of catastrophic forgetting.
  • Figure 5: Average task accuracy across sequentially introduced tasks in the low-level reaching benchmark with a buffer size of 10000. As new tasks are added during training, performance on earlier tasks declines, illustrating the effect of catastrophic forgetting.