Position: Foundation Agents as the Paradigm Shift for Decision Making
Xiaoqian Liu, Xingzhou Lou, Jianbin Jiao, Junge Zhang
TL;DR
This paper argues that foundation agents—capable, multi-modal decision-makers trained with large-scale interactive data and aligned with large language models—will redefine how agents learn, generalize, and operate across physical and virtual environments. It outlines a roadmap from data collection to self-supervised pretraining and alignment with LLMs, emphasizing unified representations, policy interfaces, and open-ended decision-making. The work surveys learning from large-scale demonstrations and simulators, discusses self-supervised pretraining objectives and downstream adaptation, and argues for memory, planning, and action modules augmented by LLM reasoning and external tools. Its contribution lies in clarifying a research agenda for theory, architecture, and real-world deployment, highlighting both the promise and the safety considerations of foundation agents in complex decision-making tasks.
Abstract
Decision making demands intricate interplay between perception, memory, and reasoning to discern optimal policies. Conventional approaches to decision making face challenges related to low sample efficiency and poor generalization. In contrast, foundation models in language and vision have showcased rapid adaptation to diverse new tasks. Therefore, we advocate for the construction of foundation agents as a transformative shift in the learning paradigm of agents. This proposal is underpinned by the formulation of foundation agents with their fundamental characteristics and challenges motivated by the success of large language models (LLMs). Moreover, we specify the roadmap of foundation agents from large interactive data collection or generation, to self-supervised pretraining and adaptation, and knowledge and value alignment with LLMs. Lastly, we pinpoint critical research questions derived from the formulation and delineate trends for foundation agents supported by real-world use cases, addressing both technical and theoretical aspects to propel the field towards a more comprehensive and impactful future.
