Table of Contents
Fetching ...

Robo-DM: Data Management For Large Robot Datasets

Kaiyuan Chen, Letian Fu, David Huang, Yanxiang Zhang, Lawrence Yunliang Chen, Huang Huang, Kush Hari, Ashwin Balakrishna, Ted Xiao, Pannag R Sanketi, John Kubiatowicz, Ken Goldberg

TL;DR

Robo-DM introduces a unified EBML-based container for multi-modal robot data (vision, language, action) to address the storage, transmission, and loading bottlenecks of large teleoperated datasets. The framework combines self-contained data storage, flexible lossy and lossless video compression, memory-mapped caching, and load-balancing to deliver dramatic data-size reductions (up to ~70x lossy, ~3.5x lossless) and faster loading compared to prior formats. Empirical evaluations on Open-X-Embodiment show substantial throughput improvements and minimal degradation in downstream tasks, with case studies including fine-tuning Octo and In-Context Robot Transformer training demonstrating practical utility. Overall, Robo-DM enables scalable, cost-efficient training of large robotic models by streamlining data collection, management, and integration with existing ML pipelines and ROS2 tooling.

Abstract

Recent results suggest that very large datasets of teleoperated robot demonstrations can be used to train transformer-based models that have the potential to generalize to new scenes, robots, and tasks. However, curating, distributing, and loading large datasets of robot trajectories, which typically consist of video, textual, and numerical modalities - including streams from multiple cameras - remains challenging. We propose Robo-DM, an efficient open-source cloud-based data management toolkit for collecting, sharing, and learning with robot data. With Robo-DM, robot datasets are stored in a self-contained format with Extensible Binary Meta Language (EBML). Robo-DM can significantly reduce the size of robot trajectory data, transfer costs, and data load time during training. Compared to the RLDS format used in OXE datasets, Robo-DM's compression saves space by up to 70x (lossy) and 3.5x (lossless). Robo-DM also accelerates data retrieval by load-balancing video decoding with memory-mapped decoding caches. Compared to LeRobot, a framework that also uses lossy video compression, Robo-DM is up to 50x faster when decoding sequentially. We physically evaluate a model trained by Robo-DM with lossy compression, a pick-and-place task, and In-Context Robot Transformer. Robo-DM uses 75x compression of the original dataset and does not suffer reduction in downstream task accuracy.

Robo-DM: Data Management For Large Robot Datasets

TL;DR

Robo-DM introduces a unified EBML-based container for multi-modal robot data (vision, language, action) to address the storage, transmission, and loading bottlenecks of large teleoperated datasets. The framework combines self-contained data storage, flexible lossy and lossless video compression, memory-mapped caching, and load-balancing to deliver dramatic data-size reductions (up to ~70x lossy, ~3.5x lossless) and faster loading compared to prior formats. Empirical evaluations on Open-X-Embodiment show substantial throughput improvements and minimal degradation in downstream tasks, with case studies including fine-tuning Octo and In-Context Robot Transformer training demonstrating practical utility. Overall, Robo-DM enables scalable, cost-efficient training of large robotic models by streamlining data collection, management, and integration with existing ML pipelines and ROS2 tooling.

Abstract

Recent results suggest that very large datasets of teleoperated robot demonstrations can be used to train transformer-based models that have the potential to generalize to new scenes, robots, and tasks. However, curating, distributing, and loading large datasets of robot trajectories, which typically consist of video, textual, and numerical modalities - including streams from multiple cameras - remains challenging. We propose Robo-DM, an efficient open-source cloud-based data management toolkit for collecting, sharing, and learning with robot data. With Robo-DM, robot datasets are stored in a self-contained format with Extensible Binary Meta Language (EBML). Robo-DM can significantly reduce the size of robot trajectory data, transfer costs, and data load time during training. Compared to the RLDS format used in OXE datasets, Robo-DM's compression saves space by up to 70x (lossy) and 3.5x (lossless). Robo-DM also accelerates data retrieval by load-balancing video decoding with memory-mapped decoding caches. Compared to LeRobot, a framework that also uses lossy video compression, Robo-DM is up to 50x faster when decoding sequentially. We physically evaluate a model trained by Robo-DM with lossy compression, a pick-and-place task, and In-Context Robot Transformer. Robo-DM uses 75x compression of the original dataset and does not suffer reduction in downstream task accuracy.

Paper Structure

This paper contains 13 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Robo-DM can streamline robot data collection, management, and learning. (B) Robo-DM uses a unified format for vision, language, and action that does not rely on assumptions about timestamps and data, and supports plug-and-play data collection to integrate with existing setups. (C) Robo-DM can facilitate replay and visualization. (D) Existing training frameworks can load from Robo-DM efficiently with minimal modification.
  • Figure 2: A File Structure Comparison of RLDS, LeRobot and Robo-DM All formats include metadata, storing descriptive information such as authors and dataset summary. (A) Reinforcement Learning Dataset (RLDS) stores episodes in partitions, where each partition is a Tensorflow Dataset Record file. All streams in episode data are compressed matrices that can be directly loaded and trained in Tensorflow. (B) LeRobot combines three formats for robot data. For vision data, it uses one MP4 per video stream in an episode, and uses HuggingFace Dataset (with Apache Arrow as backendarrow) to store language and action streams and the path to the MP4 files. It also uses safetensors safetensors to store episode information. All the streams are scattered: to extract an episode, the framework needs to query safetensors for episode information - which is used to find the rest of the non-video streams in the HuggingFace Dataset - and finally use the frame information from the HuggingFace Dataset to find the corresponding MP4 files for vision streams. (C) In Robo-DM, robot data in all the episodes are stored and aligned in a self-contained format. To load an episode, one simply reads from Robo-DM files and load as trainable matrices.
  • Figure 3: How Robo-DM stores an episode of robot data with vision, language and action data Robo-DM encodes vision, language and action data. For vision data, Robo-DM uses video or image compression; language and action data are serialized into bytes. All the bytes are encapsulated with an intake timestamp. Then Robo-DM multiplexes different streams of data into a self-describing EBML file format.
  • Figure 4: Episode Per Second Throughput of Robo-DM on Three OXE datasets with Different Characteristics We compare Robo-DM with baseline data loading Methods RLDS, HDF5 and LeRobot. Complete episodes are loaded concurrently as a batch, and we record the average throughput with 200 batches.
  • Figure 5: Concurrent Loading Latency with respect to Episode Size of Robo-DM We compare Robo-DM with baseline data loading Methods RLDS, HDF5 and LeRobot. Complete episodes are loaded concurrently as a batch, and we record the average latency of 200 batches with batch size 8 episodes. We use the lowest GCP cost of 0.02 US Dollars (USD) per GB.
  • ...and 1 more figures