KiVi: Kinesthetic-Visuospatial Integration for Dynamic and Safe Egocentric Legged Locomotion
Peizhuo Li, Hongyi Li, Yuxuan Ma, Linnan Chang, Xinrong Yang, Ruiqi Yu, Yifeng Zhang, Yuhong Cao, Qiuguo Zhu, Guillaume Sartoretti
TL;DR
KiVi addresses the fragility of vision-based legged locomotion by explicitly separating proprioceptive and visual pathways and enriching their fusion with a memory-augmented transformer. The framework uses an asymmetric actor–critic with a Kinesthetic Module and a Visuospatial Module to provide stable control while selectively leveraging vision for obstacle avoidance and terrain understanding, even under out-of-distribution visual disturbances. Empirical results show robust sim-to-real transfer, strong performance on diverse outdoor terrains, and graceful fallback to proprioception when vision is unreliable. This approach offers a practical, robust solution for real-world legged locomotion in visually challenging environments.
Abstract
Vision-based locomotion has shown great promise in enabling legged robots to perceive and adapt to complex environments. However, visual information is inherently fragile, being vulnerable to occlusions, reflections, and lighting changes, which often cause instability in locomotion. Inspired by animal sensorimotor integration, we propose KiVi, a Kinesthetic-Visuospatial integration framework, where kinesthetics encodes proprioceptive sensing of body motion and visuospatial reasoning captures visual perception of surrounding terrain. Specifically, KiVi separates these pathways, leveraging proprioception as a stable backbone while selectively incorporating vision for terrain awareness and obstacle avoidance. This modality-balanced, yet integrative design, combined with memory-enhanced attention, allows the robot to robustly interpret visual cues while maintaining fallback stability through proprioception. Extensive experiments show that our method enables quadruped robots to stably traverse diverse terrains and operate reliably in unstructured outdoor environments, remaining robust to out-of-distribution (OOD) visual noise and occlusion unseen during training, thereby highlighting its effectiveness and applicability to real-world legged locomotion.
