Table of Contents
Fetching ...

VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

Aditya Shirwatkar, Satyam Gupta, Shishir Kolathaya

Abstract

Perceptive locomotion for legged robots requires anticipating and adapting to complex, dynamic environments. Model Predictive Control (MPC) serves as a strong baseline, providing interpretable motion planning with constraint enforcement, but struggles with high-dimensional perceptual inputs and rapidly changing terrain. In contrast, model-free Reinforcement Learning (RL) adapts well across visually challenging scenarios but lacks planning. To bridge this gap, we propose VIP-Loco, a framework that integrates vision-based scene understanding with RL and planning. During training, an internal model maps proprioceptive states and depth images into compact kinodynamic features used by the RL policy. At deployment, the learned models are used within an infinite-horizon MPC formulation, combining adaptability with structured planning. We validate VIP-Loco in simulation on challenging locomotion tasks, including slopes, stairs, crawling, tilting, gap jumping, and climbing, across three robot morphologies: a quadruped (Unitree Go1), a biped (Cassie), and a wheeled-biped (TronA1-W). Through ablations and comparisons with state-of-the-art methods, we show that VIP-Loco unifies planning and perception, enabling robust, interpretable locomotion in diverse environments.

VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

Abstract

Perceptive locomotion for legged robots requires anticipating and adapting to complex, dynamic environments. Model Predictive Control (MPC) serves as a strong baseline, providing interpretable motion planning with constraint enforcement, but struggles with high-dimensional perceptual inputs and rapidly changing terrain. In contrast, model-free Reinforcement Learning (RL) adapts well across visually challenging scenarios but lacks planning. To bridge this gap, we propose VIP-Loco, a framework that integrates vision-based scene understanding with RL and planning. During training, an internal model maps proprioceptive states and depth images into compact kinodynamic features used by the RL policy. At deployment, the learned models are used within an infinite-horizon MPC formulation, combining adaptability with structured planning. We validate VIP-Loco in simulation on challenging locomotion tasks, including slopes, stairs, crawling, tilting, gap jumping, and climbing, across three robot morphologies: a quadruped (Unitree Go1), a biped (Cassie), and a wheeled-biped (TronA1-W). Through ablations and comparisons with state-of-the-art methods, we show that VIP-Loco unifies planning and perception, enabling robust, interpretable locomotion in diverse environments.
Paper Structure (21 sections, 4 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 4 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Conceptual outline of VIP-Loco -- The framework learns a compact internal model from vision and proprioception during training, which is then utilized by an infinite-horizon MPC planner at deployment to enable anticipatory, constraint-aware locomotion.
  • Figure 2: Overview of VIP-Loco Framework -- The proposed framework consists of two major components: (1) Learning Stage (left) – The internal model includes a GRU cell $g_\varphi$ (operating at 10 Hz, updating recurrent memory $h$ at each step) coupled with a CNN-MLP encoder for depth processing, an encoder/dynamics pair for latent state estimation, and reward/value heads. The Expert Actor (50 Hz) receives the imagined rollout $\mathcal{X}$ and hidden state $h$ via stop-gradient. (2) Deployment Stage (right) – a data-driven MPC that uses the learned internal model to iteratively sample and refine trajectories, leveraging vision-based scene understanding to select actions that maximize long-term reward while satisfying kinodynamic constraints.
  • Figure 3: Training comparison for Go1 (quadruped) over $5$ seeds -- Top: episodic return progression over training iterations. Bottom: average terrain level successfully mastered. VIP-Loco (Variational) achieves the highest and most stable returns, steadily mastering harder terrains (levels $\ge$ 6). WMP performs competitively but asymptotically regresses to easier terrains. VIP-Loco (Consistency) converges earlier and stagnates at lower levels.
  • Figure 4: Comparison of locomotion performance for Go1 (quadruped) across terrains of increasing difficulty and $5$ seeds -- The plots show the success rate (top row) and average return (bottom row) for five locomotion methods: HIM-Loco, PIP-Loco, WMP, VIP-Loco (with Consistency Loss), and VIP-Loco (with Variational Loss). Each column corresponds to a different terrain type and the x-axis indicates terrain difficulty level (0 = easiest, 8 = hardest).
  • Figure 5: Qualitative evaluation for VIP-Loco with planning across three robot morphologies: (a) Go1 (quadruped), (b) Cassie (biped), and (c) TronA1-W (wheeled biped). Each subfigure includes (left) predicted vs. actual CoM height trajectories for Climb and Crawl tasks, and (right) corresponding execution frames. The MPC predictions and actual measurements demonstrate interpretable dynamics modeling, while the snapshots show task-specific stable behaviors across diverse morphologies.