LHManip: A Dataset for Long-Horizon Language-Grounded Manipulation Tasks in Cluttered Tabletop Environments

Federico Ceola; Lorenzo Natale; Niko Sünderhauf; Krishan Rana

LHManip: A Dataset for Long-Horizon Language-Grounded Manipulation Tasks in Cluttered Tabletop Environments

Federico Ceola, Lorenzo Natale, Niko Sünderhauf, Krishan Rana

TL;DR

The paper tackles the lack of real-world, long-horizon, language-grounded manipulation data for home-like robotics by introducing LHManip, a dataset of $200$ teleoperated episodes over $20$ tasks involving $33$ objects, each with $10$ demonstrations and a natural language instruction. Data were collected on a Franka Panda $7$-DoF arm with multi-view RGB-D perception, proprioception, and end-to-end control at $30$ Hz, resulting in $176,278$ observation-action pairs and integration within the Open X-Embodiment project. The contribution includes detailed task descriptions, observation-action specifications, and public access to the dataset, plus accompanying preprocessing tools to convert data into RLDS formats. This work aims to spur development and benchmarking of long-horizon, language-guided manipulation methods that generalize to cluttered real-world settings and diverse object configurations, advancing practical autonomous home robotics.

Abstract

Instructing a robot to complete an everyday task within our homes has been a long-standing challenge for robotics. While recent progress in language-conditioned imitation learning and offline reinforcement learning has demonstrated impressive performance across a wide range of tasks, they are typically limited to short-horizon tasks -- not reflective of those a home robot would be expected to complete. While existing architectures have the potential to learn these desired behaviours, the lack of the necessary long-horizon, multi-step datasets for real robotic systems poses a significant challenge. To this end, we present the Long-Horizon Manipulation (LHManip) dataset comprising 200 episodes, demonstrating 20 different manipulation tasks via real robot teleoperation. The tasks entail multiple sub-tasks, including grasping, pushing, stacking and throwing objects in highly cluttered environments. Each task is paired with a natural language instruction and multi-camera viewpoints for point-cloud or NeRF reconstruction. In total, the dataset comprises 176,278 observation-action pairs which form part of the Open X-Embodiment dataset. The full LHManip dataset is made publicly available at https://github.com/fedeceola/LHManip.

LHManip: A Dataset for Long-Horizon Language-Grounded Manipulation Tasks in Cluttered Tabletop Environments

TL;DR

The paper tackles the lack of real-world, long-horizon, language-grounded manipulation data for home-like robotics by introducing LHManip, a dataset of

teleoperated episodes over

tasks involving

objects, each with

demonstrations and a natural language instruction. Data were collected on a Franka Panda

-DoF arm with multi-view RGB-D perception, proprioception, and end-to-end control at

Hz, resulting in

observation-action pairs and integration within the Open X-Embodiment project. The contribution includes detailed task descriptions, observation-action specifications, and public access to the dataset, plus accompanying preprocessing tools to convert data into RLDS formats. This work aims to spur development and benchmarking of long-horizon, language-guided manipulation methods that generalize to cluttered real-world settings and diverse object configurations, advancing practical autonomous home robotics.

Abstract

Paper Structure (9 sections, 4 figures, 3 tables)

This paper contains 9 sections, 4 figures, 3 tables.

Introduction
Related Work
LHManip
Experimental Set-Up and Data Collection
Dataset
Tasks
Observation and Action Space
Dataset Access
Conclusion

Figures (4)

Figure 1: Robot and environment setup used for data collection.
Figure 2: (a) Motion capture and robot setup. (b) The robot was teleoperated by a human operator equipped with a motion capture system for hand gestures and movements detection in the 3D space.
Figure 3: Sub-tasks decomposition of a Place the bowl on the plate and the cup in the bowl matching the color sequence.
Figure 4: Tasks variations: we consider different plate-bowl colors for the Place the bowls on the appropriate plates task (left) and different plates for the Dry the plate task (right).

LHManip: A Dataset for Long-Horizon Language-Grounded Manipulation Tasks in Cluttered Tabletop Environments

TL;DR

Abstract

LHManip: A Dataset for Long-Horizon Language-Grounded Manipulation Tasks in Cluttered Tabletop Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (4)