Table of Contents
Fetching ...

Beyond Syntax: Action Semantics Learning for App Agents

Bohan Tang, Dezhao Luo, Jianheng Liu, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

TL;DR

The paper addresses the OOD vulnerability of syntax-based App-agent fine-tuning by reframing actions through their semantics. It introduces Action Semantics Learning (ASL), which defines action semantics as the UI state transition caused by an action and optimises a semantics-aware objective via a lightweight Semantic Estimator (SEE). SEE combines a world-prediction model and a semantic similarity calculator to provide step-level semantic rewards, enabling both supervised and reinforcement fine-tuning without extra deployment costs. The authors prove and empirically demonstrate that ASL improves robustness and generalisation across diverse online and offline benchmarks (AndroidWorld, AndroidLab, AndroidControl, AitW, WebArena-Lite) and enhances RL fine-tuning, with ablations confirming the necessity of both ASL loss and SEE rewards. Overall, ASL offers a principled, open, and efficient path to robust App agents capable of handling UI variations while reducing reliance on heavy API prompts.

Abstract

The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.

Beyond Syntax: Action Semantics Learning for App Agents

TL;DR

The paper addresses the OOD vulnerability of syntax-based App-agent fine-tuning by reframing actions through their semantics. It introduces Action Semantics Learning (ASL), which defines action semantics as the UI state transition caused by an action and optimises a semantics-aware objective via a lightweight Semantic Estimator (SEE). SEE combines a world-prediction model and a semantic similarity calculator to provide step-level semantic rewards, enabling both supervised and reinforcement fine-tuning without extra deployment costs. The authors prove and empirically demonstrate that ASL improves robustness and generalisation across diverse online and offline benchmarks (AndroidWorld, AndroidLab, AndroidControl, AitW, WebArena-Lite) and enhances RL fine-tuning, with ablations confirming the necessity of both ASL loss and SEE rewards. Overall, ASL offers a principled, open, and efficient path to robust App agents capable of handling UI variations while reducing reliance on heavy API prompts.

Abstract

The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.

Paper Structure

This paper contains 28 sections, 2 theorems, 18 equations, 6 figures, 11 tables.

Key Result

Theorem 3.1

Let $\pi^{\mathrm{SFT}}_{\theta}$ be an App agent trained under the syntax-learning objective of Eq. (eq:sft_ob); for an $n$-step task with ground truth actions $\{a_1^*,\dots,a_n^*\}$, define $P(\text{success})=\Pr[\text{the agent completes all $n$ steps correctly}]$ and, for any semantically prese

Figures (6)

  • Figure 1: Our ASL framework. The reward corresponds to the function defined in Eq. (\ref{['eq:ori_ob']}) .
  • Figure 2: Examples of semantically equivalent actions: (a) and (b) lead to the same GUI state, while (c) results in a different outcome.
  • Figure 3: Training curves on AitW General and Web Shopping tasks. In all cases, incorporating our semantic estimator (SEE) leads to faster convergence and higher final success rates compared with the corresponding baselines.
  • Figure 4: Training curves on the WebArena-Lite benchmark. Compared to the original baselines, incorporating our SEE helps achieve smoother optimisation dynamics and higher final task success rates, further confirming the effectiveness of semantic-level feedback in web control.
  • Figure 5: Example of a successful case on an AitW task with our SEE module.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Definition 3.1
  • Theorem 3.2
  • proof
  • proof