Table of Contents
Fetching ...

"Pass the butter": A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT

Haohua Que, Wenbin Pan, Jie Xu, Hao Luo, Pei Wang, Li Zhang

TL;DR

This work tackles enabling practical desktop-level service robotics by integrating multimodal inputs—speech, language, and vision—on edge devices. It proposes a ROS2-based architecture that combines ASR via Whisper, NLP via BERT, and CV via YOLO, with a control hub and selective offloading of heavy computations to a remote host. The approach is validated through three tasks (open doors, switch lights, and pass a cup) with high speech recognition and action execution rates, and through platform/YOLO version comparisons, demonstrating feasibility and robustness on lightweight edge hardware. The study highlights practical implications for real-time, multimodal desktop robots and outlines future directions in miniaturization, onboard dialogue capabilities, and further optimization for edge devices.

Abstract

In recent years, various intelligent autonomous robots have begun to appear in daily life and production. Desktop-level robots are characterized by their flexible deployment, rapid response, and suitability for light workload environments. In order to meet the current societal demand for service robot technology, this study proposes using a miniaturized desktop-level robot (by ROS) as a carrier, locally deploying a natural language model (NLP-BERT), and integrating visual recognition (CV-YOLO) and speech recognition technology (ASR-Whisper) as inputs to achieve autonomous decision-making and rational action by the desktop robot. Three comprehensive experiments were designed to validate the robotic arm, and the results demonstrate excellent performance using this approach across all three experiments. In Task 1, the execution rates for speech recognition and action performance were 92.6% and 84.3%, respectively. In Task 2, the highest execution rates under the given conditions reached 92.1% and 84.6%, while in Task 3, the highest execution rates were 95.2% and 80.8%, respectively. Therefore, it can be concluded that the proposed solution integrating ASR, NLP, and other technologies on edge devices is feasible and provides a technical and engineering foundation for realizing multimodal desktop-level robots.

"Pass the butter": A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT

TL;DR

This work tackles enabling practical desktop-level service robotics by integrating multimodal inputs—speech, language, and vision—on edge devices. It proposes a ROS2-based architecture that combines ASR via Whisper, NLP via BERT, and CV via YOLO, with a control hub and selective offloading of heavy computations to a remote host. The approach is validated through three tasks (open doors, switch lights, and pass a cup) with high speech recognition and action execution rates, and through platform/YOLO version comparisons, demonstrating feasibility and robustness on lightweight edge hardware. The study highlights practical implications for real-time, multimodal desktop robots and outlines future directions in miniaturization, onboard dialogue capabilities, and further optimization for edge devices.

Abstract

In recent years, various intelligent autonomous robots have begun to appear in daily life and production. Desktop-level robots are characterized by their flexible deployment, rapid response, and suitability for light workload environments. In order to meet the current societal demand for service robot technology, this study proposes using a miniaturized desktop-level robot (by ROS) as a carrier, locally deploying a natural language model (NLP-BERT), and integrating visual recognition (CV-YOLO) and speech recognition technology (ASR-Whisper) as inputs to achieve autonomous decision-making and rational action by the desktop robot. Three comprehensive experiments were designed to validate the robotic arm, and the results demonstrate excellent performance using this approach across all three experiments. In Task 1, the execution rates for speech recognition and action performance were 92.6% and 84.3%, respectively. In Task 2, the highest execution rates under the given conditions reached 92.1% and 84.6%, while in Task 3, the highest execution rates were 95.2% and 80.8%, respectively. Therefore, it can be concluded that the proposed solution integrating ASR, NLP, and other technologies on edge devices is feasible and provides a technical and engineering foundation for realizing multimodal desktop-level robots.
Paper Structure (32 sections, 2 equations, 14 figures, 9 tables)

This paper contains 32 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: The flow of robot execution motion (Animation and Reality).
  • Figure 2: The structure of paper.
  • Figure 3: Robotic arm structure diagram and device tree.
  • Figure 4: BERT model fine-tuning for intent classification.
  • Figure 5: CMLP classification network.
  • ...and 9 more figures