Table of Contents
Fetching ...

OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim

TL;DR

OpenApps introduces a scalable, open-source framework to quantify UI-agent reliability across thousands of configurable app variants, addressing a blind spot in prior fixed-environment benchmarks. By exposing full app state and using deterministic rewards based on ground-truth targets, OpenApps enables robust, reproducible evaluation of how appearance and content variations impact task success and agent behavior. Across seven multimodal agents, the study finds substantial reliability degradation when app variations are varied, with task success fluctuating by over 50% and behaviors such as looping and hallucinating actions emerging under certain configurations. These findings highlight the need to evaluate UI-agents along the dimension of app variation for trustworthy deployment and provide a foundation for safe training, stress testing, and generalization research in agentic UI systems.

Abstract

Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent's ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than $50\%$ across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from $63\%$ to just $4\%$ across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations. OpenApps is available at https://facebookresearch.github.io/OpenApps/

OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

TL;DR

OpenApps introduces a scalable, open-source framework to quantify UI-agent reliability across thousands of configurable app variants, addressing a blind spot in prior fixed-environment benchmarks. By exposing full app state and using deterministic rewards based on ground-truth targets, OpenApps enables robust, reproducible evaluation of how appearance and content variations impact task success and agent behavior. Across seven multimodal agents, the study finds substantial reliability degradation when app variations are varied, with task success fluctuating by over 50% and behaviors such as looping and hallucinating actions emerging under certain configurations. These findings highlight the need to evaluate UI-agents along the dimension of app variation for trustworthy deployment and provide a foundation for safe training, stress testing, and generalization research in agentic UI systems.

Abstract

Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent's ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from to just across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations. OpenApps is available at https://facebookresearch.github.io/OpenApps/

Paper Structure

This paper contains 50 sections, 4 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: OpenApps can generate thousands of configurable versions of apps. OpenApps contains six apps covering common digital tasks with configurable appearance and app data for measuring a new dimension of reliability: across app variations agents are likely to encounter. OpenApps can be deployed anywhere Python can run with a single CPU (without specialized hardware, emulators or setup). In the right panel, we see average success rates for the same tasks and agents can fluctuate across app versions, suggesting app variation is a key axis of reliability.
  • Figure 2: Screenshots with example appearance variations of all six apps in OpenApps: OpenToDo, OpenCalendar, OpenMessenger, OpenMaps, OpenCodeEditor, and OpenShop. Each app is a fully functional Python application with editable state and appearance. OpenApps can be configured via simple YAML files.
  • Figure 3: We organize agents interactions with OpenApps using the standard terminologylof reinforcement learning. The environment state $s_t$ is defined by design and content variables, and initialized from a YAML specification. At each step $t$, the agent receives observations $o_t$ (screenshot and accessibility tree) and issues an action $a_t \sim \pi(a_t | \{a_{i-1},o_i\}_{0}^t)$ through BrowserGym API calls. Task success is evaluated using the underlying app state $s_t$.
  • Figure 4: Agents are sensitive to app variations. We show that the average success rate over all tasks differs across app variations. Each bar represents the average success rate over all tasks within a single app version (with three random seeds per task). Note we filter tasks where an agent has a success rate of 0 across all app variations. We see success rates can differ considerably across app versions.
  • Figure 5: Reliability within fixed app versions underestimates fluctuations in performance. In the middle of each bar we show the standard deviation of task success. We compare two settings: within a fixed app version compared to overall deviation that also accounts for difference in agent success rates across app variations.
  • ...and 5 more figures