Table of Contents
Fetching ...

RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action

Xiaoquan Sun, Ruijian Zhang, Kang Pang, Bingchen Miao, Yuxiang Tan, Zhen Yang, Ming Li, Jiayu Chen

TL;DR

RoboTidy tackles the gap in embodied AI benchmarks for language-guided tidying by unifying VLA and VLN evaluation within photorealistic 3DGS scenes and a sim-to-real workflow. It introduces an Action (Object, Container) abstraction, four manipulation primitives, and modular pipelines (Tidying, Manipulation, Navigation, Sensors) built in NVIDIA Isaac Sim, guided by Qwen2.5-VL for perception and planning. The dataset comprises 500 3DGS scenes, 6.4k manipulation trajectories, 1.5k navigation trajectories, and real-world demonstrations to support robust training and evaluation, with end-to-end real-world tidying demonstrations. Across object sorting, manipulation, and navigation tasks, RoboTidy enables rigorous generalization studies and demonstrates sim-to-real transfer benefits, highlighting the value of diverse, physically grounded data for improving language-guided robotic tidying.

Abstract

Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an "Action (Object, Container)" list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.

RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action

TL;DR

RoboTidy tackles the gap in embodied AI benchmarks for language-guided tidying by unifying VLA and VLN evaluation within photorealistic 3DGS scenes and a sim-to-real workflow. It introduces an Action (Object, Container) abstraction, four manipulation primitives, and modular pipelines (Tidying, Manipulation, Navigation, Sensors) built in NVIDIA Isaac Sim, guided by Qwen2.5-VL for perception and planning. The dataset comprises 500 3DGS scenes, 6.4k manipulation trajectories, 1.5k navigation trajectories, and real-world demonstrations to support robust training and evaluation, with end-to-end real-world tidying demonstrations. Across object sorting, manipulation, and navigation tasks, RoboTidy enables rigorous generalization studies and demonstrates sim-to-real transfer benefits, highlighting the value of diverse, physically grounded data for improving language-guided robotic tidying.

Abstract

Household tidying is an important application area, yet current benchmarks neither model user preferences nor support mobility, and they generalize poorly, making it hard to comprehensively assess integrated language-to-action capabilities. To address this, we propose RoboTidy, a unified benchmark for language-guided household tidying that supports Vision-Language-Action (VLA) and Vision-Language-Navigation (VLN) training and evaluation. RoboTidy provides 500 photorealistic 3D Gaussian Splatting (3DGS) household scenes (covering 500 objects and containers) with collisions, formulates tidying as an "Action (Object, Container)" list, and supplies 6.4k high-quality manipulation demonstration trajectories and 1.5k naviagtion trajectories to support both few-shot and large-scale training. We also deploy RoboTidy in the real world for object tidying, establishing an end-to-end benchmark for household tidying. RoboTidy offers a scalable platform and bridges a key gap in embodied AI by enabling holistic and realistic evaluation of language-guided robots.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of RoboTidy Benchmark framework and dataset. It spans Navigation, Object Sorting, and Manipulation: Qwen2.5-VL parses observations into an "Action (Object, Container)“ list executed as manipulation actions. Our dataset offers 500 3DGS household scenes, 500 objects and containers, 6.4k manipulation trajectories and 1.5k navigation trajectories for sim2real evaluation.
  • Figure 2: Object Sorting Pipline. From workspace observations, Qwen2.5-VL qwen2.5VL identifies objects and containers and produces an Action (Object, Container) list, which the system executes manipulation actions to complete sorting.
  • Figure 3: Visualization of real-world tasks.E1: four manipulation action tasks (Pick and Place, Pick and Toss, Open the Container and Close the Container). E2: household tidying task where the policy follows an "Action (Object, Container)" list to sort objects. Task progresses from left to right.
  • Figure 4: Real-world Experimental setup. (a) Workspace with objects and containers. (b) Cobot-Magic dual-arm mobile manipulation platform.
  • Figure 5: Real-world experimental results (E1). We report success rates for four manipulation action tasks under three different settings.