Table of Contents
Fetching ...

AdaptManip: Learning Adaptive Whole-Body Object Lifting and Delivery with Online Recurrent State Estimation

Morgan Byrd, Donghoon Baek, Kartik Garg, Hyunyoung Jung, Daesol Cho, Maks Sorokin, Robert Wright, Sehoon Ha

TL;DR

AdaptManip addresses the challenge of fully autonomous humanoid loco-manipulation, enabling navigation, object lifting, and delivery without human demonstrations or motion capture. It integrates a base locomotion policy, a recurrent online object state estimator, and a residual manipulation policy, all trained in simulation with domain randomization and deployed to hardware in a zero-shot manner. The key contributions include online multimodal object pose estimation that remains reliable under occlusion, perception-aware control that couples state estimation with manipulation, and strong sim-to-real transfer demonstrated on a real humanoid during autonomous navigation, lifting, and delivery. The results show improved robustness and success over baselines, with the state estimator playing a crucial role in maintaining manipulation performance when vision is unreliable.

Abstract

This paper presents Adaptive Whole-body Loco-Manipulation, AdaptManip, a fully autonomous framework for humanoid robots to perform integrated navigation, object lifting, and delivery. Unlike prior imitation learning-based approaches that rely on human demonstrations and are often brittle to disturbances, AdaptManip aims to train a robust loco-manipulation policy via reinforcement learning without human demonstrations or teleoperation data. The proposed framework consists of three coupled components: (1) a recurrent object state estimator that tracks the manipulated object in real time under limited field-of-view and occlusions; (2) a whole-body base policy for robust locomotion with residual manipulation control for stable object lifting and delivery; and (3) a LiDAR-based robot global position estimator that provides drift-robust localization. All components are trained in simulation using reinforcement learning and deployed on real hardware in a zero-shot manner. Experimental results show that AdaptManip significantly outperforms baseline methods, including imitation learning-based approaches, in adaptability and overall success rate, while accurate object state estimation improves manipulation performance even under occlusion. We further demonstrate fully autonomous real-world navigation, object lifting, and delivery on a humanoid robot.

AdaptManip: Learning Adaptive Whole-Body Object Lifting and Delivery with Online Recurrent State Estimation

TL;DR

AdaptManip addresses the challenge of fully autonomous humanoid loco-manipulation, enabling navigation, object lifting, and delivery without human demonstrations or motion capture. It integrates a base locomotion policy, a recurrent online object state estimator, and a residual manipulation policy, all trained in simulation with domain randomization and deployed to hardware in a zero-shot manner. The key contributions include online multimodal object pose estimation that remains reliable under occlusion, perception-aware control that couples state estimation with manipulation, and strong sim-to-real transfer demonstrated on a real humanoid during autonomous navigation, lifting, and delivery. The results show improved robustness and success over baselines, with the state estimator playing a crucial role in maintaining manipulation performance when vision is unreliable.

Abstract

This paper presents Adaptive Whole-body Loco-Manipulation, AdaptManip, a fully autonomous framework for humanoid robots to perform integrated navigation, object lifting, and delivery. Unlike prior imitation learning-based approaches that rely on human demonstrations and are often brittle to disturbances, AdaptManip aims to train a robust loco-manipulation policy via reinforcement learning without human demonstrations or teleoperation data. The proposed framework consists of three coupled components: (1) a recurrent object state estimator that tracks the manipulated object in real time under limited field-of-view and occlusions; (2) a whole-body base policy for robust locomotion with residual manipulation control for stable object lifting and delivery; and (3) a LiDAR-based robot global position estimator that provides drift-robust localization. All components are trained in simulation using reinforcement learning and deployed on real hardware in a zero-shot manner. Experimental results show that AdaptManip significantly outperforms baseline methods, including imitation learning-based approaches, in adaptability and overall success rate, while accurate object state estimation improves manipulation performance even under occlusion. We further demonstrate fully autonomous real-world navigation, object lifting, and delivery on a humanoid robot.
Paper Structure (20 sections, 6 equations, 6 figures, 3 tables)

This paper contains 20 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Fully autonomous humanoid loco-manipulation using online recurrent state estimation. (1) navigating toward the object, (2) lifting the object through coordinated whole-body motion, and (3) delivering the object to the target location. Our method relies solely on onboard sensing and does not require teleoperation data or an external mocap system.
  • Figure 2: Three-stage AdaptManip experiment plan and deployment. Stage 1: LiDAR odometry and proprioception enable autonomous navigation. Stage 2: Recurrent multimodal object-pose estimation supports coordinated lifting. Stage 3: Image-based refinement and residual policies ensure stable delivery. All stages operate using only onboard sensing.
  • Figure 3: Overview of the training and deployment pipeline. (1) A base whole-body control policy $\pi_{\mathrm{wbc}}$ is trained in IsaacLab to generate base whole-body behavior such as walking. (2) A manipulation residual policy $\pi_{\mathrm{res}}$ is trained on top of the base policy, taking proprioception and the estimated object state $\hat{X}_{\mathrm{box}}$ to produce residual actions $\Delta a_t$. The residual action aims to adaptively lift a 3D object. (3) A recurrent online object state estimator fuses vision and proprioceptive cues using a V-LSTM and MLP to infer $\hat{X}_{\mathrm{box}}$, and is trained jointly with the residual manipulation policy. During real-world deployment, the robot uses onboard estimators and LiDAR odometry and executes the combined policies $\pi_{\mathrm{wbc}}$ and $\pi_{\mathrm{res}}$ to complete the whole-body loco-manipulation task.
  • Figure 4: State estimation error of our method. Shows mean $\pm$ 1 standard deviation across 50 episodes. The green region shows the area where vision is available, and the purple region shows the area where there is contact between the robot and box.
  • Figure 5: Hardware demonstration of the three-stage whole-body loco-manipulation task.
  • ...and 1 more figures