Table of Contents
Fetching ...

Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale

Tobias Jülg, Pierre Krack, Seongjin Bien, Yannik Blei, Khaled Gamal, Ken Nakahara, Johannes Hechtl, Roberto Calandra, Wolfram Burgard, Florian Walter

TL;DR

This work introduces Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies, and evaluates its usability and performance along the development cycle of VLA and RL policies.

Abstract

Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets, weights, and videos are available at: https://robotcontrolstack.github.io/

Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale

TL;DR

This work introduces Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies, and evaluates its usability and performance along the development cycle of VLA and RL policies.

Abstract

Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets, weights, and videos are available at: https://robotcontrolstack.github.io/

Paper Structure

This paper contains 30 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The layered architecture of RCS: Applications at the top layer access robots, sensors, and actuators through a Gymnasium-based Python API for easy switching between hardware and a MuJoCo simulation. The lower layers expose a C++ API for performance-critical features, making RCS equally suitable for end-to-end policy learning and low-level control.
  • Figure 2: The architecture of RCS. The applications on the left side interface with the environment on the right side, which can be a simulation or a real robot, through the Gymnasium interface. Sensors, actuators, and data observers wrap the environment by mutating the action, and/or the observation space.
  • Figure 3: We deployed RCS in four different setups. From top left to bottom right:FR3 with Franka Hand gripper, wrist and side cameras; xArm7 with Tilburg Hand and one side camera; UR5e with Robotiq gripper, wrist and side cameras; SO101 with wrist and side cameras.
  • Figure 4: Configured control vs. measured data frequency during teleoperation, averaged over more than 1000 steps. The shaded area denotes the standard deviation. FR3 2 Cams: Two RealSense cameras. FR3 4+2 Cams: Four RealSense cameras and two DIGIT sensors. Dual FR3 4+2 Cams: Like FR3 4+2 Cam but with two FR3 robots.
  • Figure 5: Replicated simulation scene. Left: The simulation scene of FR3, with calibrated camera poses visible in light green. Right: Images from the wrist and side cameras in real setup (above) vs. in simulation (below).
  • ...and 2 more figures