Table of Contents
Fetching ...

Robust, Observable, and Evolvable Agentic Systems Engineering: A Principled Framework Validated via the Fairy GUI Agent

Jiazheng Sun, Ruimeng Yang, Xu Han, Jiayang Niu, Mingxuan Li, Te Yang, Yongyong Lu, Xin Peng

TL;DR

The paper argues that Agentic AI suffers from lack of Software Engineering rigor, manifested as fragility, opaque internals, and poor long-term adaptability. It introduces a principled SE framework consisting of Runtime Goal Refinement (RGR), Observable Cognitive Architecture (OCA), and Evolutionary Memory Architecture (EMA) to address these issues, and instantiates the framework in Fairy, a mobile GUI agent. Through benchmarks AndroidWorld and RealMobile-Eval, Fairy demonstrates substantial gains in task performance and maintainability, with ablation studies isolating the contributions of each principle. The work contributes a normative blueprint for constructing robust, observable, and evolvable Agentic AI systems and discusses limitations and future paths toward broader applicability and formalization.

Abstract

The Agentic Paradigm faces a significant Software Engineering Absence, yielding Agentic systems commonly lacking robustness, observability, and evolvability. To address these deficiencies, we propose a principled engineering framework comprising Runtime Goal Refinement (RGR), Observable Cognitive Architecture (OCA), and Evolutionary Memory Architecture (EMA). In this framework, RGR ensures robustness and intent alignment via knowledge-constrained refinement and human-in-the-loop clarification; OCA builds an observable and maintainable white-box architecture using component decoupling, logic layering, and state-control separation; and EMA employs an execution-evolution dual-loop for evolvability. We implemented and empirically validated Fairy, a mobile GUI agent based on this framework. On RealMobile-Eval, our novel benchmark for ambiguous and complex tasks, Fairy outperformed the best SoTA baseline in user requirement completion by 33.7%. Subsequent controlled experiments, human-subject studies, and ablation studies further confirmed that the RGR enhances refinement accuracy and prevents intent deviation; the OCA improves maintainability; and the EMA is crucial for long-term performance. This research provides empirically validated specifications and a practical blueprint for building reliable, observable, and evolvable Agentic AI systems.

Robust, Observable, and Evolvable Agentic Systems Engineering: A Principled Framework Validated via the Fairy GUI Agent

TL;DR

The paper argues that Agentic AI suffers from lack of Software Engineering rigor, manifested as fragility, opaque internals, and poor long-term adaptability. It introduces a principled SE framework consisting of Runtime Goal Refinement (RGR), Observable Cognitive Architecture (OCA), and Evolutionary Memory Architecture (EMA) to address these issues, and instantiates the framework in Fairy, a mobile GUI agent. Through benchmarks AndroidWorld and RealMobile-Eval, Fairy demonstrates substantial gains in task performance and maintainability, with ablation studies isolating the contributions of each principle. The work contributes a normative blueprint for constructing robust, observable, and evolvable Agentic AI systems and discusses limitations and future paths toward broader applicability and formalization.

Abstract

The Agentic Paradigm faces a significant Software Engineering Absence, yielding Agentic systems commonly lacking robustness, observability, and evolvability. To address these deficiencies, we propose a principled engineering framework comprising Runtime Goal Refinement (RGR), Observable Cognitive Architecture (OCA), and Evolutionary Memory Architecture (EMA). In this framework, RGR ensures robustness and intent alignment via knowledge-constrained refinement and human-in-the-loop clarification; OCA builds an observable and maintainable white-box architecture using component decoupling, logic layering, and state-control separation; and EMA employs an execution-evolution dual-loop for evolvability. We implemented and empirically validated Fairy, a mobile GUI agent based on this framework. On RealMobile-Eval, our novel benchmark for ambiguous and complex tasks, Fairy outperformed the best SoTA baseline in user requirement completion by 33.7%. Subsequent controlled experiments, human-subject studies, and ablation studies further confirmed that the RGR enhances refinement accuracy and prevents intent deviation; the OCA improves maintainability; and the EMA is crucial for long-term performance. This research provides empirically validated specifications and a practical blueprint for building reliable, observable, and evolvable Agentic AI systems.

Paper Structure

This paper contains 84 sections, 10 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: RGR-I (Runtime Refinement) vs. Traditional KAOS (Design-Time Refinement). Traditional software (Top) relies on requirements analysts to map user requirements to system modules at design-time, which renders the system unable to dynamically adapt to user needs; for example, a system targeted at purchasing meal items (e.g., via a MealCatalogService) will fail when receiving a request for a gift card. RGR-I right-shifts requirements refinement for Agentic software (Bottom) to runtime , requiring the planning engine to perform runtime refinement based on Task Knowledge. This allows it to generate new, correct sub-goals and adapt to diverse user needs.
  • Figure 2: RGR-II (Collaborative Refinement) vs. the Traditional ReAct Paradigm (Blind Refinement). In real-world scenarios, user instructions are typically ambiguous. When the environment provides multiple feasible options (e.g., multiple hamburger options appearing on the screen that all satisfy the "buy a hamburger" request), paradigms like ReAct, being task-completion oriented, will blindly refine the user requirement and unilaterally select any hamburger, leading to a failure to meet the user's needs. RGR-II identifies this (an OR-decomposition) as a Runtime Expectation , pauses autonomous execution, and collaboratively refines with the user ("Ask the user what they would like to eat"), ensuring the execution trajectory remains aligned with the user's true intent.
  • Figure 3: An architectural diagram of OCA-I (Logical Layering and Cognitive Delegation). Attempting to use a single RGR planning engine to directly decompose a high-level user intent into atomic actions is difficult (Hard), and it is also challenging to provide the appropriate Knowledge . OCA-I recommends adopting the Logical Layering and Cognitive Delegation architecture, which decomposes the task layer by layer through several RGR engines. For example: RGR Engine I is responsible for using the List of...installed apps to decompose the user intent into App-Level Tasks; RGR Engine II is responsible for refining the app-level task into App-Functional Goals ; RGR Engine III is responsible for utilizing The current phone screen"information to convert the goal into final Atomic Actions.
  • Figure 4: OCA-II (State-Control Separation) vs. Traditional Multi-Agent Black-Box Coupling. In a traditional multi-agent architecture (Top), components directly call each other, leading to severe coupling between components, which readily becomes an unobservable and difficult-to-debug black box. OCA-II achieves a white box through Cognitive Decoupling and State-Control Separation : the data flow and control flow of each component are managed by the Memory Bus (MB)and Event Bus (EB) respectively, while all components adhere to a deterministic coordination protocol, thereby ensuring system observability.
  • Figure 5: EMA (Evolutionary Memory Architecture)'s Hierarchical Memory Structure and Execution-Evolution Dual-Loop Model. EMA-I defines how knowledge is stored: Working Memory is a hierarchical Cognitive Stack, where agents in the Execution Loop write status to the stack top or read status from it. Long-Term Memory is divided into Procedural and Semantic knowledge . It is queried by agents during the Execution Loop at runtime, and updated by the Evolution Loop after the Execution Loop concludes, utilizing the complete trajectory from the Cognitive Stack. EMA-II further addresses how knowledge is used and where it comes from: The Execution Loop corrects tactical deviations caused by a non-deterministic environment via a runtime reflection mechanism; the Evolution Loop consolidates raw experiences from Working Memory into reusable capabilities within Long-Term Memory.
  • ...and 9 more figures