LHManip: A Dataset for Long-Horizon Language-Grounded Manipulation Tasks in Cluttered Tabletop Environments
Federico Ceola, Lorenzo Natale, Niko Sünderhauf, Krishan Rana
TL;DR
The paper tackles the lack of real-world, long-horizon, language-grounded manipulation data for home-like robotics by introducing LHManip, a dataset of $200$ teleoperated episodes over $20$ tasks involving $33$ objects, each with $10$ demonstrations and a natural language instruction. Data were collected on a Franka Panda $7$-DoF arm with multi-view RGB-D perception, proprioception, and end-to-end control at $30$ Hz, resulting in $176,278$ observation-action pairs and integration within the Open X-Embodiment project. The contribution includes detailed task descriptions, observation-action specifications, and public access to the dataset, plus accompanying preprocessing tools to convert data into RLDS formats. This work aims to spur development and benchmarking of long-horizon, language-guided manipulation methods that generalize to cluttered real-world settings and diverse object configurations, advancing practical autonomous home robotics.
Abstract
Instructing a robot to complete an everyday task within our homes has been a long-standing challenge for robotics. While recent progress in language-conditioned imitation learning and offline reinforcement learning has demonstrated impressive performance across a wide range of tasks, they are typically limited to short-horizon tasks -- not reflective of those a home robot would be expected to complete. While existing architectures have the potential to learn these desired behaviours, the lack of the necessary long-horizon, multi-step datasets for real robotic systems poses a significant challenge. To this end, we present the Long-Horizon Manipulation (LHManip) dataset comprising 200 episodes, demonstrating 20 different manipulation tasks via real robot teleoperation. The tasks entail multiple sub-tasks, including grasping, pushing, stacking and throwing objects in highly cluttered environments. Each task is paired with a natural language instruction and multi-camera viewpoints for point-cloud or NeRF reconstruction. In total, the dataset comprises 176,278 observation-action pairs which form part of the Open X-Embodiment dataset. The full LHManip dataset is made publicly available at https://github.com/fedeceola/LHManip.
