Beyond Syntax: Action Semantics Learning for App Agents

Bohan Tang; Dezhao Luo; Jianheng Liu; Jingxuan Chen; Shaogang Gong; Jianye Hao; Jun Wang; Kun Shao

Beyond Syntax: Action Semantics Learning for App Agents

Bohan Tang, Dezhao Luo, Jianheng Liu, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

TL;DR

The paper addresses the OOD vulnerability of syntax-based App-agent fine-tuning by reframing actions through their semantics. It introduces Action Semantics Learning (ASL), which defines action semantics as the UI state transition caused by an action and optimises a semantics-aware objective via a lightweight Semantic Estimator (SEE). SEE combines a world-prediction model and a semantic similarity calculator to provide step-level semantic rewards, enabling both supervised and reinforcement fine-tuning without extra deployment costs. The authors prove and empirically demonstrate that ASL improves robustness and generalisation across diverse online and offline benchmarks (AndroidWorld, AndroidLab, AndroidControl, AitW, WebArena-Lite) and enhances RL fine-tuning, with ablations confirming the necessity of both ASL loss and SEE rewards. Overall, ASL offers a principled, open, and efficient path to robust App agents capable of handling UI variations while reducing reliance on heavy API prompts.

Abstract

The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.

Beyond Syntax: Action Semantics Learning for App Agents

TL;DR

Abstract

Beyond Syntax: Action Semantics Learning for App Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)