Table of Contents
Fetching ...

Task-oriented Robotic Manipulation with Vision Language Models

Nurhan Bulus Guran, Hanchi Ren, Jingjing Deng, Xianghua Xie

TL;DR

This work tackles the limitation that Vision Language Models struggle to capture complex spatial relationships essential for robotic manipulation. It proposes a framework that converts scenes into hierarchical tree representations of spatial relations and combines them with object attributes; a GPT-4o-based reorganization yields task-aligned configurations. A new synthetic dataset of ~600 images with manual spatial relation captions and object attributes (fragility, mass, material, transparency) supports training and evaluation. The pipeline integrates object detection (YOLOv8), attribute extraction via a fine-tuned VLM, and triplet-based spatial reasoning to inform manipulation planning. Results indicate improved spatial understanding and more effective, task-driven object organization, suggesting a scalable path toward more autonomous robotic systems.

Abstract

Vision Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. Accurately understanding spatial relationships remains a non-trivial challenge, yet it is essential for effective robotic manipulation. In this work, we introduce a novel framework that integrates VLMs with a structured spatial reasoning pipeline to perform object manipulation based on high-level, task-oriented input. Our approach is the transformation of visual scenes into tree-structured representations that encode the spatial relations. These trees are subsequently processed by a Large Language Model (LLM) to infer restructured configurations that determine how these objects should be organised for a given high-level task. To support our framework, we also present a new dataset containing manually annotated captions that describe spatial relations among objects, along with object-level attribute annotations such as fragility, mass, material, and transparency. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

Task-oriented Robotic Manipulation with Vision Language Models

TL;DR

This work tackles the limitation that Vision Language Models struggle to capture complex spatial relationships essential for robotic manipulation. It proposes a framework that converts scenes into hierarchical tree representations of spatial relations and combines them with object attributes; a GPT-4o-based reorganization yields task-aligned configurations. A new synthetic dataset of ~600 images with manual spatial relation captions and object attributes (fragility, mass, material, transparency) supports training and evaluation. The pipeline integrates object detection (YOLOv8), attribute extraction via a fine-tuned VLM, and triplet-based spatial reasoning to inform manipulation planning. Results indicate improved spatial understanding and more effective, task-driven object organization, suggesting a scalable path toward more autonomous robotic systems.

Abstract

Vision Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. Accurately understanding spatial relationships remains a non-trivial challenge, yet it is essential for effective robotic manipulation. In this work, we introduce a novel framework that integrates VLMs with a structured spatial reasoning pipeline to perform object manipulation based on high-level, task-oriented input. Our approach is the transformation of visual scenes into tree-structured representations that encode the spatial relations. These trees are subsequently processed by a Large Language Model (LLM) to infer restructured configurations that determine how these objects should be organised for a given high-level task. To support our framework, we also present a new dataset containing manually annotated captions that describe spatial relations among objects, along with object-level attribute annotations such as fragility, mass, material, and transparency. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

Paper Structure

This paper contains 13 sections, 4 figures.

Figures (4)

  • Figure 1: Overview of our framework: Objects are first detected and their attributes are extracted using a fine-tuned VLMgao2024physically. Spatial relationships between objects are manually described and used to build tree structures representing these relationships. These tree structures are first combined with object attributes. Then this combination with a task-oriented prompt fed into a language model to generate a new representation according to the given task.
  • Figure 2: Sample synthetic images randomly arranged objects for task-oriented robotic manipulation from the dataset.
  • Figure 3: Task Simulation Examples. Initial images represent the generated images, while the second set illustrates the simulation results.
  • Figure 4: An example of initial and transformed hierarchical tree structures.