LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS
Kai Mei, Xi Zhu, Hang Gao, Shuhang Lin, Yongfeng Zhang
TL;DR
This work introduces AIOS 1.0, a platform that reframes computer-use agents by contextualizing the entire computer as an MCP server, allowing language models to reason over structured environmental representations. LiteCUA, a lightweight agent built on AIOS 1.0, uses a simple orchestrator-worker architecture and a perceive-reason-then-act cycle to operate within a sandboxed VM, achieving 14.66% success on the OSWorld benchmark and outperforming several baselines. The approach demonstrates that environmental contextualization reduces the cognitive gap between language understanding and computer interaction, enabling longer-horizon planning and safer operation, while highlighting challenges that remain in complex domains and multi-application tasks. The work points to future improvements in perception fidelity, richer action semantics, and expansion to additional computing domains to drive toward more general computer-use capabilities in AI systems.
Abstract
We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems.
