Table of Contents
Fetching ...

Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

Weiqin Zu, Wenbin Song, Ruiqing Chen, Ze Guo, Fanglei Sun, Zheng Tian, Wei Pan, Jun Wang

TL;DR

This work proposes an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field and creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system.

Abstract

The socially-aware navigation system has evolved to adeptly avoid various obstacles while performing multiple tasks, such as point-to-point navigation, human-following, and -guiding. However, a prominent gap persists: in Human-Robot Interaction (HRI), the procedure of communicating commands to robots demands intricate mathematical formulations. Furthermore, the transition between tasks does not quite possess the intuitive control and user-centric interactivity that one would desire. In this work, we propose an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field. We achieve this by first introducing a multimodal interaction framework where language and hand-drawn inputs can serve as navigation constraints and control objectives. Next, a reinforcement learning agent is built to handle multiple tasks with the received information. Crucially, LIM2N creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system. Extensive experiments are conducted in both simulation and the real world demonstrating that LIM2N has superior user needs understanding, alongside an enhanced interactive experience.

Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

TL;DR

This work proposes an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field and creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system.

Abstract

The socially-aware navigation system has evolved to adeptly avoid various obstacles while performing multiple tasks, such as point-to-point navigation, human-following, and -guiding. However, a prominent gap persists: in Human-Robot Interaction (HRI), the procedure of communicating commands to robots demands intricate mathematical formulations. Furthermore, the transition between tasks does not quite possess the intuitive control and user-centric interactivity that one would desire. In this work, we propose an LLM-driven interactive multimodal multitask robot navigation framework, termed LIM2N, to solve the above new challenge in the navigation field. We achieve this by first introducing a multimodal interaction framework where language and hand-drawn inputs can serve as navigation constraints and control objectives. Next, a reinforcement learning agent is built to handle multiple tasks with the received information. Crucially, LIM2N creates smooth cooperation among the reasoning of multimodal input, multitask planning, and adaptation and processing of the intelligent sensing modules in the complicated system. Extensive experiments are conducted in both simulation and the real world demonstrating that LIM2N has superior user needs understanding, alongside an enhanced interactive experience.
Paper Structure (13 sections, 4 equations, 9 figures, 1 table)

This paper contains 13 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Users ask a robot to guide VIP05 to the fridge, taking a route near the bookshelf (blue line). Challenges include an undetectable carpet area (red box) and a box requiring distance (gray box). Users provide guidance through language and sketches input on an interactive interface.
  • Figure 2: Overview of LIM2N. The framework contains an LLM module, an Intelligent Sensing Module, and a Reinforcement Learning Module.
  • Figure 3: Processing an 'instruction' input through LLM backbone, determining a 'Guiding' service need. Using a semantic map, it identifies the fridge and the bookshelf coordinates. The function library sets the fridge as the end goal, and sketches a path via the bookshelf. Send the output to the RL module and the Intelligent Sense module, respectively.
  • Figure 4: Using Task Mode processing, based on the task type and utilizing destination information (end goal and pedestrian positions), we determine the target location $g^t$ at time $t$. This $g_t$ and the merged laser map are provided to the SAC component as observations.
  • Figure 5: Left: Semantic map based on our simulator and real-world environment; Right: A robot in the real-world environment.
  • ...and 4 more figures