Table of Contents
Fetching ...

Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots

Yushi Wang, Changsheng Luo, Penghui Chen, Jianran Liu, Weijian Sun, Tong Guo, Kechang Yang, Biao Hu, Yangang Zhang, Mingguo Zhao

TL;DR

This work tackles the challenge of achieving real-time, vision-driven, reactive control for humanoid soccer by unifying perception and motion through a reinforcement learning framework. It extends Adversarial Motion Priors to perceptual settings, incorporating an encoder–decoder latent representation and a virtual perception system to bridge sim-to-real gaps, enabling active perception and robust ball tracking. The approach yields a single, versatile policy that demonstrates agile walking, chasing, and kicking across varied environments, with strong real-world RoboCup performance and zero-shot transfer from simulation. The study offers practical advances for embodied intelligence in unstructured domains and points to future multi-agent extensions to support team-based strategies.

Abstract

Humanoid soccer poses a representative challenge for embodied intelligence, requiring robots to operate within a tightly coupled perception-action loop. However, existing systems typically rely on decoupled modules, resulting in delayed responses and incoherent behaviors in dynamic environments, while real-world perceptual limitations further exacerbate these issues. In this work, we present a unified reinforcement learning-based controller that enables humanoid robots to acquire reactive soccer skills through the direct integration of visual perception and motion control. Our approach extends Adversarial Motion Priors to perceptual settings in real-world dynamic environments, bridging motion imitation and visually grounded dynamic control. We introduce an encoder-decoder architecture combined with a virtual perception system that models real-world visual characteristics, allowing the policy to recover privileged states from imperfect observations and establish active coordination between perception and action. The resulting controller demonstrates strong reactivity, consistently executing coherent and robust soccer behaviors across various scenarios, including real RoboCup matches.

Learning Vision-Driven Reactive Soccer Skills for Humanoid Robots

TL;DR

This work tackles the challenge of achieving real-time, vision-driven, reactive control for humanoid soccer by unifying perception and motion through a reinforcement learning framework. It extends Adversarial Motion Priors to perceptual settings, incorporating an encoder–decoder latent representation and a virtual perception system to bridge sim-to-real gaps, enabling active perception and robust ball tracking. The approach yields a single, versatile policy that demonstrates agile walking, chasing, and kicking across varied environments, with strong real-world RoboCup performance and zero-shot transfer from simulation. The study offers practical advances for embodied intelligence in unstructured domains and points to future multi-agent extensions to support team-based strategies.

Abstract

Humanoid soccer poses a representative challenge for embodied intelligence, requiring robots to operate within a tightly coupled perception-action loop. However, existing systems typically rely on decoupled modules, resulting in delayed responses and incoherent behaviors in dynamic environments, while real-world perceptual limitations further exacerbate these issues. In this work, we present a unified reinforcement learning-based controller that enables humanoid robots to acquire reactive soccer skills through the direct integration of visual perception and motion control. Our approach extends Adversarial Motion Priors to perceptual settings in real-world dynamic environments, bridging motion imitation and visually grounded dynamic control. We introduce an encoder-decoder architecture combined with a virtual perception system that models real-world visual characteristics, allowing the policy to recover privileged states from imperfect observations and establish active coordination between perception and action. The resulting controller demonstrates strong reactivity, consistently executing coherent and robust soccer behaviors across various scenarios, including real RoboCup matches.

Paper Structure

This paper contains 19 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: System overview. The real-world robot is equipped with an onboard camera for visual perception. Image detections are projected into the BEV space. Ball detections are provided directly to the policy, while field landmarks are processed by an odometer module to infer the goal location from long-term information. The perception pipeline is designed to efficiently extract and represent visual features for the RL policy.
  • Figure 2: Performance of the controller in various scenarios. (A to F) Real match performance in cluttered environments with disturbances. (G to I) Reactive responses and real-time adaptation to the ball. (J to L) Robust behavior across varying terrain and visually diverse environments.
  • Figure 3: Validation and behavior analysis. (A) The background grid color represents the success rate of 8192 simulation tests, while the dots indicate the success rate of 10 consecutive hardware tests. Owing to effective alignment, our policy delivers reliable hardware performance that closely matches simulation results. (B) The robot searches for the ball in the distance when starting near the field edge, guided by the policy’s estimated ball position. (C) The robot turns to search for the ball behind itself when it is near the field center. (D and E) The robot’s foothold locations and timing reveal adaptive gait with shorter strides and faster cadence, enabling effective adjustment before kicking, as illustrated by the forward and backward kick examples.
  • Figure 4: Perception-action coordination. (A) Training curves for different methods, reporting overall success rates in disturbed training environments. (B) Proportion of ball perception, average perception error, and policy's ball position estimation error across 4096 kicking tests. (C) Distribution of angular distance between the ball and the camera center, evaluated over 1000 steps across 2048 environments. The shaded area indicates the camera's FOV. (D and E) The policy's estimation predicts the ball's movement when the ball is removed from the robot's FOV, guiding the robot to re-acquire visual detection of the ball in the direction it disappeared.
  • Figure 5: Versatile gait behaviors. Visualization of 20,000 collected joint-space trajectory frames together with the reference motion dataset reduced to a 2D plane using UMAP. Five distinct clusters highlight the policy's ability to integrate reference motions and generalize beyond them to generate task-specific behaviors. From each major cluster, a representative trajectory is selected to illustrate the corresponding behavior.
  • ...and 4 more figures