Table of Contents
Fetching ...

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Pei Yang, Hai Ci, Mike Zheng Shou

TL;DR

macOSWorld delivers the first comprehensive, multilingual benchmark for GUI agents on macOS, addressing gaps in platform coverage, language diversity, and integrated safety evaluation. It combines 202 interactive tasks across 30 applications (28 macOS-exclusive) with task instructions and OS interfaces in five languages, plus a dedicated context-deception safety subset evaluated in real macOS environments hosted on AWS. Six GUI agents spanning proprietary CUAs, general VLMs, and open-source baselines reveal a clear performance gap, with CUAs achieving over 30% success while lightweight open-source models stay below 5–10%, and multilingual degradation—especially for Arabic—highlighting grounding and planning challenges. The study demonstrates language and cross-language mismatches as critical bottlenecks and shows strong deception-vulnerability signals in safety tests, underscoring the need for macOS-specific adaptation and stronger safety mechanisms for GUI agents with system-level control. Overall, macOSWorld provides a practical platform for advancing macOS GUI agents toward robust, multilingual, and safer real-world use.

Abstract

Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30% success rate, while open-source lightweight research models lag at below 5\%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 28.8% average degradation compared to English. Results from safety benchmarking also highlight that deception attacks are more general and demand immediate attention. Project page: https://macos-world.github.io.

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

TL;DR

macOSWorld delivers the first comprehensive, multilingual benchmark for GUI agents on macOS, addressing gaps in platform coverage, language diversity, and integrated safety evaluation. It combines 202 interactive tasks across 30 applications (28 macOS-exclusive) with task instructions and OS interfaces in five languages, plus a dedicated context-deception safety subset evaluated in real macOS environments hosted on AWS. Six GUI agents spanning proprietary CUAs, general VLMs, and open-source baselines reveal a clear performance gap, with CUAs achieving over 30% success while lightweight open-source models stay below 5–10%, and multilingual degradation—especially for Arabic—highlighting grounding and planning challenges. The study demonstrates language and cross-language mismatches as critical bottlenecks and shows strong deception-vulnerability signals in safety tests, underscoring the need for macOS-specific adaptation and stronger safety mechanisms for GUI agents with system-level control. Overall, macOSWorld provides a practical platform for advancing macOS GUI agents toward robust, multilingual, and safer real-world use.

Abstract

Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead at above 30% success rate, while open-source lightweight research models lag at below 5\%, highlighting the need for macOS domain adaptation. Multilingual benchmarks also expose common weaknesses, especially in Arabic, with a 28.8% average degradation compared to English. Results from safety benchmarking also highlight that deception attacks are more general and demand immediate attention. Project page: https://macos-world.github.io.

Paper Structure

This paper contains 63 sections, 20 figures, 9 tables.

Figures (20)

  • Figure 1: macOSWorld is an interactive computer-use benchmark, allowing GUI agents to operate in a real macOS environment and complete a series of tasks. To facilitate multilingual benchmarking, both the tasks and the environments are provided in 5 languages.
  • Figure 2: macOSWorld benchmark infrastructure. The main components are (1) a suite of multilingual tasks with natural‐language instructions and programmatic evaluation, (2) interactive, reproducible macOS computer environments hosted on AWS, and (3) a centralized testbench that drives the evaluation process, orchestrates different components via SSH, VNC, and AWS APIs.
  • Figure 3: macOSWorld statistics and human performance. (a) Number of tasks available in each language. (b) Task distribution across seven categories. (c) Human performance on each task plotted as a scatter of total time versus number of steps used. (d) Histogram of task per-step time usage.
  • Figure 4: Examples of our safety benchmarking subset. (a) Pop-up window attack popup (top) versus our macOS‑style deceptive pop-up window (bottom). (b) Four examples of our attack. Our method spawns real pop-up windows in the environment with several buttons. Only when the distracting buttons are clicked would the attack be considered successful.
  • Figure 5: Example of a task involving updating a dishwasher selection in a Numbers project document, together with its environment preparation and evaluation scripts.
  • ...and 15 more figures