Table of Contents
Fetching ...

CoNav: A Benchmark for Human-Centered Collaborative Navigation

Changhao Li, Xinyu Sun, Peihao Chen, Jugang Fan, Zixu Wang, Yanxia Liu, Jinhui Zhu, Chuang Gan, Mingkui Tan

TL;DR

CoNav tackles indoor human-centered collaborative navigation by introducing a benchmark that requires an agent to observe human actions, infer long- and short-term intent, and navigate to the human's intended destination before the human completes the action. The authors build a dataset using LLMs to generate environment-aligned activities, MotionGPT and PriorMDM to animate realistic humanoid motion, and Habitat 3.0 to integrate the motions into 3D scenes, yielding over 25k trajectories across 49 environments. An intention-aware agent is proposed, combining a Perceiver-based long-term intention predictor with a GST-based short-term trajectory predictor, feeding an LSTM policy that uses panoramic RGB-D observations to decide actions. Experimental results show the CoNav agent outperforms baselines, with ablations highlighting the value of predicting both the intended object and activity as well as short-term trajectories, demonstrating meaningful progress toward practical human-robot collaborative navigation in indoor settings.

Abstract

Human-robot collaboration, in which the robot intelligently assists the human with the upcoming task, is an appealing objective. To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, where the agent should reason human intention by observing human activities and then navigate to the human's intended destination in advance of the human. However, this vital ability has not been well studied in previous literature. To fill this gap, we propose a collaborative navigation (CoNav) benchmark. Our CoNav tackles the critical challenge of constructing a 3D navigation environment with realistic and diverse human activities. To achieve this, we design a novel LLM-based humanoid animation generation framework, which is conditioned on both text descriptions and environmental context. The generated humanoid trajectory obeys the environmental context and can be easily integrated into popular simulators. We empirically find that the existing navigation methods struggle in CoNav task since they neglect the perception of human intention. To solve this problem, we propose an intention-aware agent for reasoning both long-term and short-term human intention. The agent predicts navigation action based on the predicted intention and panoramic observation. The emergent agent behavior including observing humans, avoiding human collision, and navigation reveals the efficiency of the proposed datasets and agents.

CoNav: A Benchmark for Human-Centered Collaborative Navigation

TL;DR

CoNav tackles indoor human-centered collaborative navigation by introducing a benchmark that requires an agent to observe human actions, infer long- and short-term intent, and navigate to the human's intended destination before the human completes the action. The authors build a dataset using LLMs to generate environment-aligned activities, MotionGPT and PriorMDM to animate realistic humanoid motion, and Habitat 3.0 to integrate the motions into 3D scenes, yielding over 25k trajectories across 49 environments. An intention-aware agent is proposed, combining a Perceiver-based long-term intention predictor with a GST-based short-term trajectory predictor, feeding an LSTM policy that uses panoramic RGB-D observations to decide actions. Experimental results show the CoNav agent outperforms baselines, with ablations highlighting the value of predicting both the intended object and activity as well as short-term trajectories, demonstrating meaningful progress toward practical human-robot collaborative navigation in indoor settings.

Abstract

Human-robot collaboration, in which the robot intelligently assists the human with the upcoming task, is an appealing objective. To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, where the agent should reason human intention by observing human activities and then navigate to the human's intended destination in advance of the human. However, this vital ability has not been well studied in previous literature. To fill this gap, we propose a collaborative navigation (CoNav) benchmark. Our CoNav tackles the critical challenge of constructing a 3D navigation environment with realistic and diverse human activities. To achieve this, we design a novel LLM-based humanoid animation generation framework, which is conditioned on both text descriptions and environmental context. The generated humanoid trajectory obeys the environmental context and can be easily integrated into popular simulators. We empirically find that the existing navigation methods struggle in CoNav task since they neglect the perception of human intention. To solve this problem, we propose an intention-aware agent for reasoning both long-term and short-term human intention. The agent predicts navigation action based on the predicted intention and panoramic observation. The emergent agent behavior including observing humans, avoiding human collision, and navigation reveals the efficiency of the proposed datasets and agents.
Paper Structure (20 sections, 1 equation, 11 figures, 4 tables)

This paper contains 20 sections, 1 equation, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Illustration of an example episode in our CoNav task. The agent observes the human activity and then predicts and navigates to the human's intended destination.
  • Figure 2: Overview of CoNav dataset generation. We first use LLMs to reason environment-aligned causal activities. Then, generative models are used to animate activity and walking motion.
  • Figure 3: Visualization of generated (a) humanoid activity and (b) trajectory.
  • Figure 4: Overall architecture of our intention-aware agent. The policy takes as input the predicted intention, the panoramic RGB-D, and the agent pose for determining navigation actions.
  • Figure 5: Visualization of a testing episode.
  • ...and 6 more figures