Table of Contents
Fetching ...

RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI

Hongzhi Zang, Shu'ang Yu, Hao Lin, Tianxing Zhou, Zefang Huang, Zhen Guo, Xin Xu, Jiakai Zhou, Yuze Sheng, Shizhe Zhang, Feng Gao, Wenhao Tang, Yufeng Yue, Quanlu Zhang, Xinlei Chen, Chao Yu, Yu Wang

TL;DR

Real-world online policy learning in embodied AI is severely data- and systems-limited by real-time constraints, hardware heterogeneity, and cloud–edge networking. The authors propose USER, a unified, extensible system that treats robots as first-class hardware, provides an adaptive cloud–edge communication plane, and implements a fully asynchronous pipeline with a persistent, cache-aware buffer, enabling long-horizon, data-efficient learning across diverse policies and large vision-language-action models. Key contributions include a HAL for automatic discovery and scheduling of heterogeneous robots, SM-aware NCCL synchronization to manage GPU contention, distributed data channels to localize traffic, and a persistent buffer enabling replay and recovery across sessions. Experiments demonstrate multi-robot and heterogeneous deployments, edge–cloud collaboration with large models, and substantial improvements in data throughput and learning stability, establishing a practical foundation for real-world online policy learning.

Abstract

Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheaply reset, or massively replicated, which makes scalable data collection, heterogeneous deployment, and long-horizon effective training difficult. These challenges suggest that real-world policy learning is not only an algorithmic issue but fundamentally a systems problem. We present USER, a Unified and extensible SystEm for Real-world online policy learning. USER treats physical robots as first-class hardware resources alongside GPUs through a unified hardware abstraction layer, enabling automatic discovery, management, and scheduling of heterogeneous robots. To address cloud-edge communication, USER introduces an adaptive communication plane with tunneling-based networking, distributed data channels for traffic localization, and streaming-multiprocessor-aware weight synchronization to regulate GPU-side overhead. On top of this infrastructure, USER organizes learning as a fully asynchronous framework with a persistent, cache-aware buffer, enabling efficient long-horizon experiments with robust crash recovery and reuse of historical data. In addition, USER provides extensible abstractions for rewards, algorithms, and policies, supporting online imitation or reinforcement learning of CNN/MLP, generative policies, and large vision-language-action (VLA) models within a unified pipeline. Results in both simulation and the real world show that USER enables multi-robot coordination, heterogeneous manipulators, edge-cloud collaboration with large models, and long-running asynchronous training, offering a unified and extensible systems foundation for real-world online policy learning.

RLinf-USER: A Unified and Extensible System for Real-World Online Policy Learning in Embodied AI

TL;DR

Real-world online policy learning in embodied AI is severely data- and systems-limited by real-time constraints, hardware heterogeneity, and cloud–edge networking. The authors propose USER, a unified, extensible system that treats robots as first-class hardware, provides an adaptive cloud–edge communication plane, and implements a fully asynchronous pipeline with a persistent, cache-aware buffer, enabling long-horizon, data-efficient learning across diverse policies and large vision-language-action models. Key contributions include a HAL for automatic discovery and scheduling of heterogeneous robots, SM-aware NCCL synchronization to manage GPU contention, distributed data channels to localize traffic, and a persistent buffer enabling replay and recovery across sessions. Experiments demonstrate multi-robot and heterogeneous deployments, edge–cloud collaboration with large models, and substantial improvements in data throughput and learning stability, establishing a practical foundation for real-world online policy learning.

Abstract

Online policy learning directly in the physical world is a promising yet challenging direction for embodied intelligence. Unlike simulation, real-world systems cannot be arbitrarily accelerated, cheaply reset, or massively replicated, which makes scalable data collection, heterogeneous deployment, and long-horizon effective training difficult. These challenges suggest that real-world policy learning is not only an algorithmic issue but fundamentally a systems problem. We present USER, a Unified and extensible SystEm for Real-world online policy learning. USER treats physical robots as first-class hardware resources alongside GPUs through a unified hardware abstraction layer, enabling automatic discovery, management, and scheduling of heterogeneous robots. To address cloud-edge communication, USER introduces an adaptive communication plane with tunneling-based networking, distributed data channels for traffic localization, and streaming-multiprocessor-aware weight synchronization to regulate GPU-side overhead. On top of this infrastructure, USER organizes learning as a fully asynchronous framework with a persistent, cache-aware buffer, enabling efficient long-horizon experiments with robust crash recovery and reuse of historical data. In addition, USER provides extensible abstractions for rewards, algorithms, and policies, supporting online imitation or reinforcement learning of CNN/MLP, generative policies, and large vision-language-action (VLA) models within a unified pipeline. Results in both simulation and the real world show that USER enables multi-robot coordination, heterogeneous manipulators, edge-cloud collaboration with large models, and long-running asynchronous training, offering a unified and extensible systems foundation for real-world online policy learning.
Paper Structure (33 sections, 11 equations, 15 figures, 14 tables)

This paper contains 33 sections, 11 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: We propose USER, a Unified and extensible SystEm for Real-world online policy learning.
  • Figure 2: The system architecture design of USER.
  • Figure 3: Overview of learning framework design: a fully asynchronous real-world learning pipeline with a persistent, cache-aware buffer and extensible abstractions for policies, algorithms, and reward models.
  • Figure 4: Fully Asynchronous pipeline. USER decouples data generation, training, data transmission, and weight synchronization, significantly improving both data collection and training throughput.
  • Figure 5: Persistent-Cache-Aware Buffer. USER adopts a persistent, index-based buffer. Recent data is stored in memory while historical data is persisted to disk, effectively balancing access efficiency with storage capacity.
  • ...and 10 more figures