Table of Contents
Fetching ...

Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning

Xiaowen Sun, Xufeng Zhao, Jae Hee Lee, Wenhao Lu, Matthias Kerzel, Stefan Wermter

TL;DR

An Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks, is introduced and it is shown that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach.

Abstract

The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation. However, detecting an object's state and generating a state-sensitive plan for robots is challenging. Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans. However, to the best of our knowledge, there is hardly any investigation on whether LLMs or VLMs can also generate object state-sensitive plans. To study this, we introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks. We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module (dense captioning model, DCM) and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM. To quantitatively evaluate the performances of the two methods, we use tabletop scenarios where the task is to clear the table. We contribute a multimodal benchmark dataset that takes object states into consideration. Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach. The code for OSSA is available at https://github.com/Xiao-wen-Sun/OSSA

Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning

TL;DR

An Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks, is introduced and it is shown that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach.

Abstract

The state of an object reflects its current status or condition and is important for a robot's task planning and manipulation. However, detecting an object's state and generating a state-sensitive plan for robots is challenging. Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown impressive capabilities in generating plans. However, to the best of our knowledge, there is hardly any investigation on whether LLMs or VLMs can also generate object state-sensitive plans. To study this, we introduce an Object State-Sensitive Agent (OSSA), a task-planning agent empowered by pre-trained neural networks. We propose two methods for OSSA: (i) a modular model consisting of a pre-trained vision processing module (dense captioning model, DCM) and a natural language processing model (LLM), and (ii) a monolithic model consisting only of a VLM. To quantitatively evaluate the performances of the two methods, we use tabletop scenarios where the task is to clear the table. We contribute a multimodal benchmark dataset that takes object states into consideration. Our results show that both methods can be used for object state-sensitive tasks, but the monolithic approach outperforms the modular approach. The code for OSSA is available at https://github.com/Xiao-wen-Sun/OSSA
Paper Structure (17 sections, 4 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: The given scene contains various objects in various states. For example, orange, half-orange, and orange peel; clean napkin and dirty napkin; banana and banana peel. Based on commonsense knowledge, the agent sorts the objects (discard the banana peel in the trash bin; keep the bananas in the cupboard). However, the robot is not able to decide how to deal with the leftover food because different people may have different preferences regarding leftover food (e.g., half orange and half bread).
  • Figure 2: Overview of our two proposed methods for OSSA: (a) OSSA-LLM-DCM represents the modular model that combines a prompt large language model (LLM) and a dense captioning model (DCM); (b) OSSA-VLM represents only a vision-language model (VLM).
  • Figure 3: Chain-of-thought for OSSA. (a) The pre-trained model (e.g., LLM or VLM) utilizes commonsense knowledge to reason about the object state; (b) according to the object's state and user's preference, the model generates a destination for the object; (c) according to the object's state, shape, and size, the model generates a grasping action for the object; (d) according to the object's state and destination, the model generates a placing action for the object.
  • Figure 4: Dataset Statistics