Table of Contents
Fetching ...

A Modular AIoT Framework for Low-Latency Real-Time Robotic Teleoperation in Smart Cities

Shih-Chieh Sun, Yun-Cheng Tsai

TL;DR

This work tackles the demand for real-time, remote robotic manipulation in smart-city contexts by introducing a modular AIoT framework that unifies edge AI inference, lightweight IoT messaging, and cross-platform mobile interaction. The approach hinges on a dual-protocol architecture (MQTT for control, WebRTC via LiveKit for video) paired with a Flutter UI and a 6-DOF robotic platform, underpinned by a cloud-based YOLOv11-nano perception stack and edge/cloud deployment on DigitalOcean. Key contributions include the integrated, scalable architecture, empirical latency reductions (actuator latency as low as $0.2$ s locally and <$0.7$ s across VPNs; end-to-end video latency <$1.2$ s with AI overlays), and a demonstrated 94.6% remote grasping success rate across diverse object classes. The framework offers a practical blueprint for smart-city teleoperation, enabling remote education, cross-border collaboration, and distributed industrial tasks with cost-effective, resilient operation and clear extensibility to new hardware and AI models.

Abstract

This paper presents an AI-driven IoT robotic teleoperation system designed for real-time remote manipulation and intelligent visual monitoring, tailored for smart city applications. The architecture integrates a Flutter-based cross-platform mobile interface with MQTT-based control signaling and WebRTC video streaming via the LiveKit framework. A YOLOv11-nano model is deployed for lightweight object detection, enabling real-time perception with annotated visual overlays delivered to the user interface. Control commands are transmitted via MQTT to an ESP8266-based actuator node, which coordinates multi-axis robotic arm motion through an Arduino Mega2560 controller. The backend infrastructure is hosted on DigitalOcean, ensuring scalable cloud orchestration and stable global communication. Latency evaluations conducted under both local and international VPN scenarios (including Hong Kong, Japan, and Belgium) demonstrate actuator response times as low as 0.2 seconds and total video latency under 1.2 seconds, even across high-latency networks. This low-latency dual-protocol design ensures responsive closed-loop interaction and robust performance in distributed environments. Unlike conventional teleoperation platforms, the proposed system emphasizes modular deployment, real-time AI sensing, and adaptable communication strategies, making it well-suited for smart city scenarios such as remote infrastructure inspection, public equipment servicing, and urban automation. Future enhancements will focus on edge-device deployment, adaptive routing, and integration with city-scale IoT networks to enhance resilience and scalability.

A Modular AIoT Framework for Low-Latency Real-Time Robotic Teleoperation in Smart Cities

TL;DR

This work tackles the demand for real-time, remote robotic manipulation in smart-city contexts by introducing a modular AIoT framework that unifies edge AI inference, lightweight IoT messaging, and cross-platform mobile interaction. The approach hinges on a dual-protocol architecture (MQTT for control, WebRTC via LiveKit for video) paired with a Flutter UI and a 6-DOF robotic platform, underpinned by a cloud-based YOLOv11-nano perception stack and edge/cloud deployment on DigitalOcean. Key contributions include the integrated, scalable architecture, empirical latency reductions (actuator latency as low as s locally and < s across VPNs; end-to-end video latency < s with AI overlays), and a demonstrated 94.6% remote grasping success rate across diverse object classes. The framework offers a practical blueprint for smart-city teleoperation, enabling remote education, cross-border collaboration, and distributed industrial tasks with cost-effective, resilient operation and clear extensibility to new hardware and AI models.

Abstract

This paper presents an AI-driven IoT robotic teleoperation system designed for real-time remote manipulation and intelligent visual monitoring, tailored for smart city applications. The architecture integrates a Flutter-based cross-platform mobile interface with MQTT-based control signaling and WebRTC video streaming via the LiveKit framework. A YOLOv11-nano model is deployed for lightweight object detection, enabling real-time perception with annotated visual overlays delivered to the user interface. Control commands are transmitted via MQTT to an ESP8266-based actuator node, which coordinates multi-axis robotic arm motion through an Arduino Mega2560 controller. The backend infrastructure is hosted on DigitalOcean, ensuring scalable cloud orchestration and stable global communication. Latency evaluations conducted under both local and international VPN scenarios (including Hong Kong, Japan, and Belgium) demonstrate actuator response times as low as 0.2 seconds and total video latency under 1.2 seconds, even across high-latency networks. This low-latency dual-protocol design ensures responsive closed-loop interaction and robust performance in distributed environments. Unlike conventional teleoperation platforms, the proposed system emphasizes modular deployment, real-time AI sensing, and adaptable communication strategies, making it well-suited for smart city scenarios such as remote infrastructure inspection, public equipment servicing, and urban automation. Future enhancements will focus on edge-device deployment, adaptive routing, and integration with city-scale IoT networks to enhance resilience and scalability.

Paper Structure

This paper contains 18 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: System architecture showing data flow between the Flutter-based mobile interface, MQTT/WebRTC backend, ESP8266–Arduino actuation platform, and cloud-based YOLO inference server on DigitalOcean.
  • Figure 2: End-to-end prototype integrating the remote app (Fig. \ref{['Real-time']}) and local hardware.
  • Figure 3: LiveKit session logs showing stable multi-user operation.
  • Figure 4: Six-degree-of-freedom robotic arm structure with labeled joints.
  • Figure 5: Real-time interface and robot response: YOLOv11 bounding boxes guide remote user selection and actuation.
  • ...and 1 more figures