Vision-Proprioception Fusion with Mamba2 in End-to-End Reinforcement Learning for Motion Control
Xiaowen Tao, Yinuo Wang, Jinzhao Zhou
TL;DR
This work introduces SSD-Mamba2 as a vision–proprioception fusion backbone for end-to-end reinforcement learning in quadrupedal motion control. Proprioceptive states are embedded via an MLP and depth images are tokenized by a lightweight CNN, with tokens fused through stacked SSD-Mamba2 layers that support near-linear complexity and long-horizon modeling. Trained with PPO under domain randomization and an obstacle-density curriculum, the approach achieves higher returns, fewer collisions, and longer travel distances than proprioception-only and Transformer-based baselines, while converging faster under equal compute budgets. The results indicate strong potential for safe, real-time robotic control on resource-constrained hardware and provide a path toward robust sim-to-real transfer and broader applications in safety-critical motion control.
Abstract
End-to-end reinforcement learning (RL) for motion control trains policies directly from sensor inputs to motor commands, enabling unified controllers for different robots and tasks. However, most existing methods are either blind (proprioception-only) or rely on fusion backbones with unfavorable compute-memory trade-offs. Recurrent controllers struggle with long-horizon credit assignment, and Transformer-based fusion incurs quadratic cost in token length, limiting temporal and spatial context. We present a vision-driven cross-modal RL framework built on SSD-Mamba2, a selective state-space backbone that applies state-space duality (SSD) to enable both recurrent and convolutional scanning with hardware-aware streaming and near-linear scaling. Proprioceptive states and exteroceptive observations (e.g., depth tokens) are encoded into compact tokens and fused by stacked SSD-Mamba2 layers. The selective state-space updates retain long-range dependencies with markedly lower latency and memory use than quadratic self-attention, enabling longer look-ahead, higher token resolution, and stable training under limited compute. Policies are trained end-to-end under curricula that randomize terrain and appearance and progressively increase scene complexity. A compact, state-centric reward balances task progress, energy efficiency, and safety. Across diverse motion-control scenarios, our approach consistently surpasses strong state-of-the-art baselines in return, safety (collisions and falls), and sample efficiency, while converging faster at the same compute budget. These results suggest that SSD-Mamba2 provides a practical fusion backbone for resource-constrained robotic and autonomous systems in engineering informatics applications.
