Table of Contents
Fetching ...

Learning Robot Soccer from Egocentric Vision with Deep Reinforcement Learning

Dhruva Tirumala, Markus Wulfmeier, Ben Moran, Sandy Huang, Jan Humplik, Guy Lever, Tuomas Haarnoja, Leonard Hasenclever, Arunkumar Byravan, Nathan Batchelor, Neil Sreendra, Kushal Patel, Marlon Gwira, Francesco Nori, Martin Riedmiller, Nicolas Heess

TL;DR

The paper addresses end-to-end, vision-based multi-agent robot soccer with onboard sensing and partial observability, proposing a two-stage RL framework and zero-shot sim-to-real transfer using NeRF-based rendering. It combines memory-enabled policy learning, Replay across Experiments for data efficiency, and adaptive KL-regularization to distill expert skills into a single robust agent. Empirical results show emergent active perception behaviors, robust ball tracking, and agility on par with state-based policies in simulation, with real-world transfer demonstrated on humanoid robots, albeit with some performance drop due to real-world noise. The work highlights NeRF-based realistic rendering and diverse visual domain randomization as essential for sim-to-real success, and analyzes data-source strategies to favor vision-based learning for complex, long-horizon tasks.

Abstract

We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website https://sites.google.com/view/vision-soccer .

Learning Robot Soccer from Egocentric Vision with Deep Reinforcement Learning

TL;DR

The paper addresses end-to-end, vision-based multi-agent robot soccer with onboard sensing and partial observability, proposing a two-stage RL framework and zero-shot sim-to-real transfer using NeRF-based rendering. It combines memory-enabled policy learning, Replay across Experiments for data efficiency, and adaptive KL-regularization to distill expert skills into a single robust agent. Empirical results show emergent active perception behaviors, robust ball tracking, and agility on par with state-based policies in simulation, with real-world transfer demonstrated on humanoid robots, albeit with some performance drop due to real-world noise. The work highlights NeRF-based realistic rendering and diverse visual domain randomization as essential for sim-to-real success, and analyzes data-source strategies to favor vision-based learning for complex, long-horizon tasks.

Abstract

We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website https://sites.google.com/view/vision-soccer .
Paper Structure (30 sections, 1 equation, 9 figures, 3 tables)

This paper contains 30 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Method Overview. In each experiment, we train the agent with off-policy RL via Replay across Experiments (RaE) tirumala2024replay from a mixture of current experience and previously collected data. We additionally regularize the agent in later training stages towards teacher policies. The agent is trained purely in simulation, and transferred zero-shot to the physical world. The agent only perceives onboard observations, consisting of RGB images and proprioception.
  • Figure 2: Emergent behaviors. Each row shows a different emergent behavior. The agent's camera view is at the top right of each frame. Top row: The agent pivots to scan the scene and localize the ball, walks towards it, and positions itself to shoot. Middle: Agents scramble for the ball and try to shoot past each other. Bottom: The agent positions itself to prevent the opponent from scoring.
  • Figure 3: Shooting behavior, real-world: The agent smoothly transitions between skills and scores.
  • Figure 4: Position prediction results: Each row shows the predicted positions of either the agent, opponent, or ball across time. Each frame shows the simulation with the egocentric camera view in the top right and the predicted quantity in the bottom right. The heatmap of the Gaussian mixture model converges to a point as the certainty increases. Ground truth is indicated by a black cross.
  • Figure 5: Position prediction results, real world: Each column displays the egocentric camera view (top), predicted agent (middle) and ball (bottom) position from the same instant in time. Initially the agent is uncertain about both its own position and the ball's position. After finding the ball the predictions become more certain and are relatively accurate. After kicking the ball (column 4), the agent continues to accurately predict the ball location even when the ball is no longer in view.
  • ...and 4 more figures