Table of Contents
Fetching ...

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

Alhassan Mumuni, Fuseini Mumuni

TL;DR

The paper surveys foundational principles—embodiment, symbol grounding, causality, and memory—as essential components for achieving artificial general intelligence (AGI) with large language models and multimodal foundation models. It analyzes current state-of-the-art approaches, including embodied agents, knowledge graphs, ontology-driven prompting, RAG, physics-informed world models, and neuro-symbolic grounding, highlighting their roles and limitations. A holistic AGI framework is proposed that interconnects embodiment, grounding, causality, and memory, illustrating how their integration can enable robust, generalizable intelligent agents. The discussion emphasizes the need for unified design, scalable data, and interactive environments to advance toward human-level general intelligence.

Abstract

Generative artificial intelligence (AI) systems based on large-scale pretrained foundation models (PFMs) such as vision-language models, large language models (LLMs), diffusion models and vision-language-action (VLA) models have demonstrated the ability to solve complex and truly non-trivial AI problems in a wide variety of domains and contexts. Multimodal large language models (MLLMs), in particular, learn from vast and diverse data sources, allowing rich and nuanced representations of the world and, thereby, providing extensive capabilities, including the ability to reason, engage in meaningful dialog; collaborate with humans and other agents to jointly solve complex problems; and understand social and emotional aspects of humans. Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle. Consequently, generic LLMs are severely limited in their generalist capabilities. A number of foundational problems -- embodiment, symbol grounding, causality and memory -- are required to be addressed for LLMs to attain human-level general intelligence. These concepts are more aligned with human cognition and provide LLMs with inherent human-like cognitive properties that support the realization of physically-plausible, semantically meaningful, flexible and more generalizable knowledge and intelligence. In this work, we discuss the aforementioned foundational issues and survey state-of-the art approaches for implementing these concepts in LLMs. Specifically, we discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

TL;DR

The paper surveys foundational principles—embodiment, symbol grounding, causality, and memory—as essential components for achieving artificial general intelligence (AGI) with large language models and multimodal foundation models. It analyzes current state-of-the-art approaches, including embodied agents, knowledge graphs, ontology-driven prompting, RAG, physics-informed world models, and neuro-symbolic grounding, highlighting their roles and limitations. A holistic AGI framework is proposed that interconnects embodiment, grounding, causality, and memory, illustrating how their integration can enable robust, generalizable intelligent agents. The discussion emphasizes the need for unified design, scalable data, and interactive environments to advance toward human-level general intelligence.

Abstract

Generative artificial intelligence (AI) systems based on large-scale pretrained foundation models (PFMs) such as vision-language models, large language models (LLMs), diffusion models and vision-language-action (VLA) models have demonstrated the ability to solve complex and truly non-trivial AI problems in a wide variety of domains and contexts. Multimodal large language models (MLLMs), in particular, learn from vast and diverse data sources, allowing rich and nuanced representations of the world and, thereby, providing extensive capabilities, including the ability to reason, engage in meaningful dialog; collaborate with humans and other agents to jointly solve complex problems; and understand social and emotional aspects of humans. Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle. Consequently, generic LLMs are severely limited in their generalist capabilities. A number of foundational problems -- embodiment, symbol grounding, causality and memory -- are required to be addressed for LLMs to attain human-level general intelligence. These concepts are more aligned with human cognition and provide LLMs with inherent human-like cognitive properties that support the realization of physically-plausible, semantically meaningful, flexible and more generalizable knowledge and intelligence. In this work, we discuss the aforementioned foundational issues and survey state-of-the art approaches for implementing these concepts in LLMs. Specifically, we discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.
Paper Structure (58 sections, 16 figures, 3 tables)

This paper contains 58 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Some of the most important features of artificial general intelligence (AGI) systems. These features give AGI systems vast cognitive capabilities despite the models' limited knowledge and the need to, for the sake of conserving energy and time, take shortcuts in cognitive information processing.
  • Figure 2: LLM versus human intelligence: Important mechanisms that allow flexible extension of knowledge and cognitive abilities.
  • Figure 3: A summary of the essence and role of each of the foundational AGI concepts covered in this work.
  • Figure 4: In this scene, two intelligent agents A and B assist during an emergency. When driven by high-level goals that are aligned with human interests and values, such agents can perform good acts spontaneously. Goal-awareness allows them to be proactive, autonomous and capable of attending to multiple tasks without deviating from their main essence.
  • Figure 5: A simplified representation of EmbodiedGPT mu2024embodiedgpt. The framework utilizes a large-scale egocentric, EgoCOT—curated as part of the work, to teach agents a wide range of embodied skills, including video captioning, visual question answering, multi-turn dialog as well as navigation and object manipulation in the physical world. It consists of four integrated components: (a) a vision-transformer to encode visual information from observations; (b) a custom submodule, so named Embodied-Former, to map input text and images (i.e., embodied instructions and visual information), and to generate relevant features for embodied, high-level planning and low-level control tasks; (c) a large language model to perform language-related tasks (e.g., image captioning, planning and embodied question answering); (d) a so-called policy network that generates low-level actions from the features learned by the Embodied-Former submodule. These actions allow the agent to physically interact with the real world using its actuators. Chain of thought approach is used to generate task-relevant goals from prompts.
  • ...and 11 more figures