Table of Contents
Fetching ...

WildLMa: Long Horizon Loco-Manipulation in the Wild

Ri-Zhao Qiu, Yuchen Song, Xuanbin Peng, Sai Aneesh Suryadevara, Ge Yang, Minghuan Liu, Mazeyu Ji, Chengzhe Jia, Ruihan Yang, Xueyan Zou, Xiaolong Wang

TL;DR

WildLMa introduces a modular framework for in-the-wild loco-manipulation using a quadruped with a manipulator. It combines a VR-adapted whole-body controller for data-efficient teleoperation, a generalizable WildLMa-Skill library learned via imitation with CLIP-based language conditioning, and WildLMa-Planner to compose skills with an LLM for long-horizon tasks. The approach is validated through extensive real-world demonstrations, ablations, and comparisons showing improved generalization, robustness to unseen objects, and effective long-horizon execution. The results highlight practical potential for deploying legged robots in diverse environments, enabling complex tasks beyond simple pick-and-place.

Abstract

'In-the-wild' mobile manipulation aims to deploy robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place. Quadruped robots with manipulators hold promise for extending the workspace and enabling robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) adaptation of learned low-level controller for VR-enabled whole-body teleoperation and traversability; (2) WildLMa-Skill -- a library of generalizable visuomotor skills acquired via imitation learning or heuristics and (3) WildLMa-Planner -- an interface of learned skills that allow LLM planners to coordinate skills for long-horizon tasks. We demonstrate the importance of high-quality training data by achieving higher grasping success rate over existing RL baselines using only tens of demonstrations. WildLMa exploits CLIP for language-conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.

WildLMa: Long Horizon Loco-Manipulation in the Wild

TL;DR

WildLMa introduces a modular framework for in-the-wild loco-manipulation using a quadruped with a manipulator. It combines a VR-adapted whole-body controller for data-efficient teleoperation, a generalizable WildLMa-Skill library learned via imitation with CLIP-based language conditioning, and WildLMa-Planner to compose skills with an LLM for long-horizon tasks. The approach is validated through extensive real-world demonstrations, ablations, and comparisons showing improved generalization, robustness to unseen objects, and effective long-horizon execution. The results highlight practical potential for deploying legged robots in diverse environments, enabling complex tasks beyond simple pick-and-place.

Abstract

'In-the-wild' mobile manipulation aims to deploy robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place. Quadruped robots with manipulators hold promise for extending the workspace and enabling robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) adaptation of learned low-level controller for VR-enabled whole-body teleoperation and traversability; (2) WildLMa-Skill -- a library of generalizable visuomotor skills acquired via imitation learning or heuristics and (3) WildLMa-Planner -- an interface of learned skills that allow LLM planners to coordinate skills for long-horizon tasks. We demonstrate the importance of high-quality training data by achieving higher grasping success rate over existing RL baselines using only tens of demonstrations. WildLMa exploits CLIP for language-conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of WildLMa models and robot setups. (a) WildLMa takes a frozen CLIP model to encode task-specific texts and visual observations; (b) Our robot platform is a Unitree B1 quadruped combined with a Unitree Z1 arm and a 3D-printed gripper, with two RGBD cameras and one lidar mounted on.
  • Figure 2: Overview of WildLMa-planner. Given a constructed hierarchical scene graph, WildLMa-planner adopts a coarse-to-fine searching mechanism to determine node traversal and structured actions to take.
  • Figure 3: Qualitative illustrations of some evaluated tasks.