Large Action Models: From Inception to Implementation
Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
TL;DR
This work advocates Large Action Models (LAMs) that move beyond text-only responses to generate and execute actions in real-world and digital environments. It presents a generalizable, Windows-focused pipeline from data collection (task-plan and task-action data) through multi-phase training (planning, expert imitation, self-exploration, and reward-based refinement) to offline and online evaluation within a GUI agent (UFO). The key contributions include a disciplined data workflow, a four-phase training regime, and empirical results showing improved task success rates and efficiency compared with strong baselines. The framework highlights practical considerations for grounding, memory, safety, and deployment, outlining directions for scalability and real-world industrial adoption. Overall, the paper demonstrates how LAMs can bridge the gap between understanding user intent and autonomously executing complex tasks in dynamic environments, advancing toward more capable AI agents.
Abstract
As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence. In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications. The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/.
