Table of Contents
Fetching ...

RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains

Shady Nasrat, Myungsu Kim, Seonil Lee, Jiho Lee, Yeoncheol Jang, Seung-joon Yi

TL;DR

RDMM addresses the challenge of enabling domain-specific, self-aware decision making for robots using fine-tuned, quantized LLMs that run entirely on-device. The approach combines 4-bit quantization with QLoRA adapters to produce RDMM models capable of self-knowledge grounded planning, accessible via a modular framework including a parser/controller, a Vision-Language Model, YOLO detectors, and on-device ASR. A RoboCup@Home–focused dataset (27k planning instances; 1.3k annotated images) and an open-source framework underpin the work, achieving high planning accuracy (up to 92.98% for RDMM-8B) and enabling operation on hardware with as little as 8 GB RAM. Real-world RoboCup@Home experiments demonstrate autonomous performance and natural human-robot interaction, underscoring the practicality of edge-enabled LLMs for domestic robotics.

Abstract

Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. We showcase the capabilities of our framework within the context of the real-world household competition. This research introduces a framework that utilizes RDMM (Robotics Decision-Making Models), which possess the capacity for decision-making within domain-specific contexts, as well as an awareness of their personal knowledge and capabilities. The framework leverages information to enhance the autonomous decision-making of the system. In contrast to other approaches, our focus is on real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. Our framework incorporates visual perception models equipping robots with understanding of their environment. Additionally, the framework has integrated real-time speech recognition capabilities, thus enhancing the human-robot interaction experience. Experimental results demonstrate that the RDMM framework can plan with an 93\% accuracy. Furthermore, we introduce a new dataset consisting of 27k planning instances, as well as 1.3k text-image annotated samples derived from the competition. The framework, benchmarks, datasets, and models developed in this work are publicly available on our GitHub repository at https://github.com/shadynasrat/RDMM.

RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains

TL;DR

RDMM addresses the challenge of enabling domain-specific, self-aware decision making for robots using fine-tuned, quantized LLMs that run entirely on-device. The approach combines 4-bit quantization with QLoRA adapters to produce RDMM models capable of self-knowledge grounded planning, accessible via a modular framework including a parser/controller, a Vision-Language Model, YOLO detectors, and on-device ASR. A RoboCup@Home–focused dataset (27k planning instances; 1.3k annotated images) and an open-source framework underpin the work, achieving high planning accuracy (up to 92.98% for RDMM-8B) and enabling operation on hardware with as little as 8 GB RAM. Real-world RoboCup@Home experiments demonstrate autonomous performance and natural human-robot interaction, underscoring the practicality of edge-enabled LLMs for domestic robotics.

Abstract

Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. We showcase the capabilities of our framework within the context of the real-world household competition. This research introduces a framework that utilizes RDMM (Robotics Decision-Making Models), which possess the capacity for decision-making within domain-specific contexts, as well as an awareness of their personal knowledge and capabilities. The framework leverages information to enhance the autonomous decision-making of the system. In contrast to other approaches, our focus is on real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. Our framework incorporates visual perception models equipping robots with understanding of their environment. Additionally, the framework has integrated real-time speech recognition capabilities, thus enhancing the human-robot interaction experience. Experimental results demonstrate that the RDMM framework can plan with an 93\% accuracy. Furthermore, we introduce a new dataset consisting of 27k planning instances, as well as 1.3k text-image annotated samples derived from the competition. The framework, benchmarks, datasets, and models developed in this work are publicly available on our GitHub repository at https://github.com/shadynasrat/RDMM.

Paper Structure

This paper contains 18 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: RDMM Overview: The process begins by fine-tuning quantized LLM models on our specialized dataset to create RDMM models. The illustration showcases an example of RDMM's On-Device inference, followed by the proposed framework parsing the RDMM-generated plans for execution. These plans are carried out using a controller that interacts with various models and enabling both robotic manipulation and locomotion.
  • Figure 2: Dataset Distribution by Task: An overview of the dataset allocation, illustrating the ratio of data dedicated to each specific task. Ensuring balanced and comprehensive training for task-specific model performance.
  • Figure 3: Household Robot Planning with RDMM: This illustration shows three examples of Lucio, a home service robot, using local RDMM model inference to plan and execute tasks. These include planning actions to make cereal, answering self-awareness questions about Lucio’s personal memory, and combining actions with self-awareness by retrieving an apple for a person and engaging in conversation about itself.
  • Figure 4: Benchmark Accuracy Across Tasks: This graph presents the evaluation results for RDMM-8B, RDMM-7B, and RDMM-0.5B models, compared with 20-shot conditioned baseline models Llama3-8B, Mistral-7B, and Qwen2-0.5B, alongside GPT-4o and GPT-4o-mini. It highlights their accuracy across various tasks, offering insights into each model's performance in different task scenarios.
  • Figure 5: Framework VRAM consumption: A graphical representation depicting the VRAM usage of each model within the framework.
  • ...and 1 more figures