ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts

Bilel Benjdira; Anis Koubaa; Anas M. Ali

ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts

Bilel Benjdira, Anis Koubaa, Anas M. Ali

TL;DR

This work formalizes the Prompting Robotic Modalities (PRM) design pattern and demonstrates its realization in ROSGPT_Vision, a framework that commands robots using only modality-specific prompts mediated by a central Task Modality. By integrating Vision-Language Models (VLMs) such as LLaVA, MiniGPT-4, and SAM with ROS2, ROSGPT_Vision translates visual input into natural language and then leverages an LLM to determine robotic actions, all configurable via YAML prompts. The CarMate application serves as a concrete proof-of-concept for driver monitoring and real-time feedback, achieved with substantial development cost reductions thanks to the PRM architecture and two-prompt workflow. The work contributes an open-source architecture, modular prompting strategies, and a pathway for future multi-modal robotic systems that rely on prompt engineering rather than handcrafted pipelines, potentially accelerating research and deployment in real-world settings.

Abstract

In this paper, we argue that the next generation of robots can be commanded using only Language Models' prompts. Every prompt interrogates separately a specific Robotic Modality via its Modality Language Model (MLM). A central Task Modality mediates the whole communication to execute the robotic mission via a Large Language Model (LLM). This paper gives this new robotic design pattern the name of: Prompting Robotic Modalities (PRM). Moreover, this paper applies this PRM design pattern in building a new robotic framework named ROSGPT_Vision. ROSGPT_Vision allows the execution of a robotic task using only two prompts: a Visual and an LLM prompt. The Visual Prompt extracts, in natural language, the visual semantic features related to the task under consideration (Visual Robotic Modality). Meanwhile, the LLM Prompt regulates the robotic reaction to the visual description (Task Modality). The framework automates all the mechanisms behind these two prompts. The framework enables the robot to address complex real-world scenarios by processing visual data, making informed decisions, and carrying out actions automatically. The framework comprises one generic vision module and two independent ROS nodes. As a test application, we used ROSGPT_Vision to develop CarMate, which monitors the driver's distraction on the roads and makes real-time vocal notifications to the driver. We showed how ROSGPT_Vision significantly reduced the development cost compared to traditional methods. We demonstrated how to improve the quality of the application by optimizing the prompting strategies, without delving into technical details. ROSGPT_Vision is shared with the community (link: https://github.com/bilel-bj/ROSGPT_Vision) to advance robotic research in this direction and to build more robotic frameworks that implement the PRM design pattern and enables controlling robots using only prompts.

ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts

TL;DR

Abstract

ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)