PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

Kelin Fu; Yang Tian; Kaigui Bian

PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

Kelin Fu, Yang Tian, Kaigui Bian

TL;DR

PeriGuru is devised and developed, a peripheral robotic mobile app operation assistant based on GUI image understanding and prompting with Large Language Model (LLM), which achieves a success rate of 81.94% on the test task set and surpasses by more than double the method without PeriGuru’s GUI image interpreting and prompting design.

Abstract

Smartphones have significantly enhanced our daily learning, communication, and entertainment, becoming an essential component of modern life. However, certain populations, including the elderly and individuals with disabilities, encounter challenges in utilizing smartphones, thus necessitating mobile app operation assistants, a.k.a. mobile app agent. With considerations for privacy, permissions, and cross-platform compatibility issues, we endeavor to devise and develop PeriGuru in this work, a peripheral robotic mobile app operation assistant based on GUI image understanding and prompting with Large Language Model (LLM). PeriGuru leverages a suite of computer vision techniques to analyze GUI screenshot images and employs LLM to inform action decisions, which are then executed by robotic arms. PeriGuru achieves a success rate of 81.94% on the test task set, which surpasses by more than double the method without PeriGuru's GUI image interpreting and prompting design. Our code is available on https://github.com/Z2sJ4t/PeriGuru.

PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

TL;DR

Abstract

Paper Structure (11 sections, 6 figures, 3 tables)

This paper contains 11 sections, 6 figures, 3 tables.

Introduction
Limitations of Generalized Large Language Models
Method
Perception: GUI Layout Understanding
Decision-Making: LLM Agent Establishment
Action: Action Space Design
Experiments
Methodology
Results
Related Work
Conclusion and Future Work

Figures (6)

Figure 1: A showcase of coordinate misunderstanding by AppAgent (grid). On the left is a portion of the screenshot with added grids, and on the right is a part of the output provided by gpt-4-vison-preview. The task is to follow the team "Sparks", whcih means the agent should tap on the region that encompasses the team's logo, as indicated by the red dashed box on the left. However, the agent's actual selection is marked by the red cross, which deviates from the intended target area and, consequently, fails to accomplish the task as prescribed.
Figure 2: PeriGuru's system overview.
Figure 3: PeriGuru's GUI image process architecture.
Figure 4: Augmentation of color and shape make the training set data collected through software screenshots closer to the training set captured by the camera, and thus enhance the robustness of the object detection model.
Figure 5: An example of PeriGuru's robotic arm testbed. It recognizes the GUI elements such as the search icon and keyboard, to organize the robotic arm to execute a sequence of actions that align with decisions rendered by LLM.
...and 1 more figures

PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

TL;DR

Abstract

PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

Authors

TL;DR

Abstract

Table of Contents

Figures (6)