Table of Contents
Fetching ...

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai

TL;DR

This work defines ORS3D, a task where embodied agents must leverage Operations Research-based scheduling alongside precise 3D grounding to complete composite tasks efficiently. It introduces ORS3D-60K, a large-scale dataset of 60,825 composite tasks across 4,376 real-world scenes that require parallelizable subtasks and grounded actions, and GRANT, a multi-modal LLM equipped with a Scheduling Token Mechanism that interfaces with an optimizer to produce efficient schedules and grounded step-by-step actions. The approach demonstrates substantial improvements in scheduling efficiency and grounding accuracy, with the scheduling module delivering a 30.53% gain in TE and GRANT outperforming scene-level grounding baselines. By tightly integrating language understanding, 3D grounding, and OR-informed scheduling, this work advances practical embodied planning and provides a foundation for end-to-end, OR-aware multi-modal embodied AI in realistic environments.

Abstract

Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

TL;DR

This work defines ORS3D, a task where embodied agents must leverage Operations Research-based scheduling alongside precise 3D grounding to complete composite tasks efficiently. It introduces ORS3D-60K, a large-scale dataset of 60,825 composite tasks across 4,376 real-world scenes that require parallelizable subtasks and grounded actions, and GRANT, a multi-modal LLM equipped with a Scheduling Token Mechanism that interfaces with an optimizer to produce efficient schedules and grounded step-by-step actions. The approach demonstrates substantial improvements in scheduling efficiency and grounding accuracy, with the scheduling module delivering a 30.53% gain in TE and GRANT outperforming scene-level grounding baselines. By tightly integrating language understanding, 3D grounding, and OR-informed scheduling, this work advances practical embodied planning and provides a foundation for end-to-end, OR-aware multi-modal embodied AI in realistic environments.

Abstract

Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of different task completion schemes. An embodied agent is expected to use operations research knowledge to efficiently complete tasks through scheduling.
  • Figure 2: Illustration of the proposed Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D). When assigned a composite task by a human, the embodied agent needs to complete the subtasks efficiently by carefully scheduling using operations research knowledge and simultaneously locating the target objects in each step for navigation and manipulation.
  • Figure 3: Non-parallelizable subtask & parallelizable subtask.
  • Figure 4: (a) The ORS3D-60K dataset generation pipeline, which first generates subtask meta-information from 3D scene graphs, then uses this information to generate the structured dataset. (b) A composite task example from ORS3D-60K dataset. The green color mask indicates the ground-truth target object in the corresponding step.
  • Figure 5: Distributions of (a) subtask number in each composite task, and (b) the expected time of each subtask.
  • ...and 2 more figures