Table of Contents
Fetching ...

Position: Foundation Agents as the Paradigm Shift for Decision Making

Xiaoqian Liu, Xingzhou Lou, Jianbin Jiao, Junge Zhang

TL;DR

This paper argues that foundation agents—capable, multi-modal decision-makers trained with large-scale interactive data and aligned with large language models—will redefine how agents learn, generalize, and operate across physical and virtual environments. It outlines a roadmap from data collection to self-supervised pretraining and alignment with LLMs, emphasizing unified representations, policy interfaces, and open-ended decision-making. The work surveys learning from large-scale demonstrations and simulators, discusses self-supervised pretraining objectives and downstream adaptation, and argues for memory, planning, and action modules augmented by LLM reasoning and external tools. Its contribution lies in clarifying a research agenda for theory, architecture, and real-world deployment, highlighting both the promise and the safety considerations of foundation agents in complex decision-making tasks.

Abstract

Decision making demands intricate interplay between perception, memory, and reasoning to discern optimal policies. Conventional approaches to decision making face challenges related to low sample efficiency and poor generalization. In contrast, foundation models in language and vision have showcased rapid adaptation to diverse new tasks. Therefore, we advocate for the construction of foundation agents as a transformative shift in the learning paradigm of agents. This proposal is underpinned by the formulation of foundation agents with their fundamental characteristics and challenges motivated by the success of large language models (LLMs). Moreover, we specify the roadmap of foundation agents from large interactive data collection or generation, to self-supervised pretraining and adaptation, and knowledge and value alignment with LLMs. Lastly, we pinpoint critical research questions derived from the formulation and delineate trends for foundation agents supported by real-world use cases, addressing both technical and theoretical aspects to propel the field towards a more comprehensive and impactful future.

Position: Foundation Agents as the Paradigm Shift for Decision Making

TL;DR

This paper argues that foundation agents—capable, multi-modal decision-makers trained with large-scale interactive data and aligned with large language models—will redefine how agents learn, generalize, and operate across physical and virtual environments. It outlines a roadmap from data collection to self-supervised pretraining and alignment with LLMs, emphasizing unified representations, policy interfaces, and open-ended decision-making. The work surveys learning from large-scale demonstrations and simulators, discusses self-supervised pretraining objectives and downstream adaptation, and argues for memory, planning, and action modules augmented by LLM reasoning and external tools. Its contribution lies in clarifying a research agenda for theory, architecture, and real-world deployment, highlighting both the promise and the safety considerations of foundation agents in complex decision-making tasks.

Abstract

Decision making demands intricate interplay between perception, memory, and reasoning to discern optimal policies. Conventional approaches to decision making face challenges related to low sample efficiency and poor generalization. In contrast, foundation models in language and vision have showcased rapid adaptation to diverse new tasks. Therefore, we advocate for the construction of foundation agents as a transformative shift in the learning paradigm of agents. This proposal is underpinned by the formulation of foundation agents with their fundamental characteristics and challenges motivated by the success of large language models (LLMs). Moreover, we specify the roadmap of foundation agents from large interactive data collection or generation, to self-supervised pretraining and adaptation, and knowledge and value alignment with LLMs. Lastly, we pinpoint critical research questions derived from the formulation and delineate trends for foundation agents supported by real-world use cases, addressing both technical and theoretical aspects to propel the field towards a more comprehensive and impactful future.
Paper Structure (29 sections, 2 figures, 5 tables)

This paper contains 29 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Roadmap of foundation agents. Left: Large interactive data collection or generation scales up foundation agents in open-ended physical and virtual worlds. Middle: Self-supervised pretraining leverages the flexible and robust architecture of Transformer vaswani2017attention based on autoregressive or masked (in gray) modeling. During adaptation, the pretrained model play various roles in decision making, such as policy initialization meng2021offlineyang2021representation and dynamics model wu2023maskedbrandfonbrener2023inverse. Distinct colors denote different variables within trajectories. Right: LLMs as a part of foundation agents composed of planning, memory and action modules and act as world models, information processors, and decision-makers, respectively.
  • Figure 2: Visualized fine-tuning performance of our pretrained agent on unseen task and game. Frames are selected from the video recorded in the last evaluation episode. Total number of evaluation episodes is 50. Results are from one training seed.