Table of Contents
Fetching ...

MorFiC: Fixing Value Miscalibration for Zero-Shot Quadruped Transfer

Prakhar Mishra, Amir Hossain Raj, Xuesu Xiao, Dinesh Manocha

Abstract

Generalizing learned locomotion policies across quadrupedal robots with different morphologies remain a challenge. Policies trained on a single robot often break when deployed on embodiments with different mass distributions, kinematics, joint limits, or actuation constraints, forcing per robot retraining. We present MorFiC, a reinforcement learning approach for zero-shot cross-morphology locomotion using a single shared policy. MorFiC resolves a key failure mode in multi-morphology actor-critic training: a shared critic tends to average incompatible value targets across embodiments, yielding miscalibrated advantages. To address this, MorFiC conditions the critic via morphology-aware modulation driven by robot physical and control parameters, generating morphology-specific value estimates within a shared network. Trained with a single source robot with morphology randomization in simulation, MorFiC can transfer to unseen robots and surpasses morphology-conditioned PPO baselines by improving stable average speed and longest stable run on multiple targets, including speed gains of +16.1% on A1, ~2x on Cheetah, and ~5x on B1. We additionally show that MorFiC reduces the value-prediction error variance across morphologies and stabilizes the advantage estimates, demonstrating that the improved value-function calibration corresponds to a stronger transfer performance. Finally, we demonstrate zero-shot deployment on two Unitree Go1 and Go2 robots without fine-tuning, indicating that critic-side conditioning is a practical approach for cross-morphology generalization.

MorFiC: Fixing Value Miscalibration for Zero-Shot Quadruped Transfer

Abstract

Generalizing learned locomotion policies across quadrupedal robots with different morphologies remain a challenge. Policies trained on a single robot often break when deployed on embodiments with different mass distributions, kinematics, joint limits, or actuation constraints, forcing per robot retraining. We present MorFiC, a reinforcement learning approach for zero-shot cross-morphology locomotion using a single shared policy. MorFiC resolves a key failure mode in multi-morphology actor-critic training: a shared critic tends to average incompatible value targets across embodiments, yielding miscalibrated advantages. To address this, MorFiC conditions the critic via morphology-aware modulation driven by robot physical and control parameters, generating morphology-specific value estimates within a shared network. Trained with a single source robot with morphology randomization in simulation, MorFiC can transfer to unseen robots and surpasses morphology-conditioned PPO baselines by improving stable average speed and longest stable run on multiple targets, including speed gains of +16.1% on A1, ~2x on Cheetah, and ~5x on B1. We additionally show that MorFiC reduces the value-prediction error variance across morphologies and stabilizes the advantage estimates, demonstrating that the improved value-function calibration corresponds to a stronger transfer performance. Finally, we demonstrate zero-shot deployment on two Unitree Go1 and Go2 robots without fine-tuning, indicating that critic-side conditioning is a practical approach for cross-morphology generalization.
Paper Structure (42 sections, 15 equations, 5 figures, 7 tables)

This paper contains 42 sections, 15 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: We present MorFiC, a reinforcement learning framework for cross-morphology locomotion that trains a single policy on one source quadruped and transfers it to other robots with different morphologies (e.g., different mass distributions, joint limits, and actuation constraints) without retraining, by improving how the value function generalizes under morphology shift using a morphology-aware FiLM-modulated critic. Real-world deployment on a Unitree Go1: MorFiC runs stably at $\approx 1.4$--$1.7$ m/s without task-specific fine-tuning. Top two rows: representative Go1 roll-outs (indoor and outdoor). Third row: push-disturbance recovery on Go1. Bottom row: Zero-shot deployment on Unitree Go2 (runs at $\approx 1.1$--$1.3$ m/s) using the same trained policy weights.
  • Figure 2: Overview of MorFiC. First we sample morphology descriptor $m \sim p(m)$ and then apply it to the simulator (masses/inertia, joint limits/angles, torque limits, and PD/action scaling) to instantiate morphology-specific dynamics and generate roll-outs. After that we train a policy with PPO using a morphology latent $z = f(m)$; the critic is FiLM modulated by $z$ to predict a morphology conditioned value and improve advantage estimates. Deployment (right) runs the trained JIT modules (morph encoder and body network, with optional adaptation) with an LCM based state estimator and RC/command profile input.
  • Figure 3: Critic calibration and advantage stability under morphology shift. (a) Explained variance (EV) between bootstrapped returns and critic predictions across target robots; MorFiC maintains higher EV on out-of-distribution morphologies than PPO and morphology-to-policy baselines. (b) Advantage-noise proxy computed from the same rollouts; MorFiC reduces advantage noise on OOD robots, supporting the thesis that morphology-conditioned value calibration yields more stable updates and stronger zero-shot transfer.
  • Figure 4: Above figure shows representative reward and value trajectories during deployment on the Go1 robot. The MorFiC produces smooth value estimates (orange) (second row) that track the expected discounted returns from noisy instantaneous rewards (green). While non-MorFic value estimates are noisy (first row), MorFiC's morphology-conditioned critic maintains stable value predictions even on unseen morphologies.
  • Figure 5: From the above image, its clear that as the distance from the trained robot (Go2) increases there is decline in the performance but MorFiC still performs better than other variants of PPO significantly and also improved the generalization across OOD robots better than any other method.