Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

Alhassan Mumuni; Fuseini Mumuni

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

Alhassan Mumuni, Fuseini Mumuni

TL;DR

The paper surveys foundational principles—embodiment, symbol grounding, causality, and memory—as essential components for achieving artificial general intelligence (AGI) with large language models and multimodal foundation models. It analyzes current state-of-the-art approaches, including embodied agents, knowledge graphs, ontology-driven prompting, RAG, physics-informed world models, and neuro-symbolic grounding, highlighting their roles and limitations. A holistic AGI framework is proposed that interconnects embodiment, grounding, causality, and memory, illustrating how their integration can enable robust, generalizable intelligent agents. The discussion emphasizes the need for unified design, scalable data, and interactive environments to advance toward human-level general intelligence.

Abstract

Generative artificial intelligence (AI) systems based on large-scale pretrained foundation models (PFMs) such as vision-language models, large language models (LLMs), diffusion models and vision-language-action (VLA) models have demonstrated the ability to solve complex and truly non-trivial AI problems in a wide variety of domains and contexts. Multimodal large language models (MLLMs), in particular, learn from vast and diverse data sources, allowing rich and nuanced representations of the world and, thereby, providing extensive capabilities, including the ability to reason, engage in meaningful dialog; collaborate with humans and other agents to jointly solve complex problems; and understand social and emotional aspects of humans. Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle. Consequently, generic LLMs are severely limited in their generalist capabilities. A number of foundational problems -- embodiment, symbol grounding, causality and memory -- are required to be addressed for LLMs to attain human-level general intelligence. These concepts are more aligned with human cognition and provide LLMs with inherent human-like cognitive properties that support the realization of physically-plausible, semantically meaningful, flexible and more generalizable knowledge and intelligence. In this work, we discuss the aforementioned foundational issues and survey state-of-the art approaches for implementing these concepts in LLMs. Specifically, we discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

TL;DR

Abstract

Paper Structure (58 sections, 16 figures, 3 tables)

This paper contains 58 sections, 16 figures, 3 tables.

Introduction
Background
Language as the foundation of general intelligence in biological systems
Language as a medium of knowledge acquisition, representation and organization
Language as a tool for cognitive information processing
The concept of artificial general intelligence
Scope and outline of work
Towards artificial general intelligence with large language models
Large language models and artificial general intelligence
Features of large language models that support the attainment of human-level intelligence
Overview of foundational principles for AGI with LLMs
Embodiment
Basic concept of embodiment
Embodiment as the foundation of general intelligence
Key aspects of embodied intelligence
...and 43 more sections

Figures (16)

Figure 1: Some of the most important features of artificial general intelligence (AGI) systems. These features give AGI systems vast cognitive capabilities despite the models' limited knowledge and the need to, for the sake of conserving energy and time, take shortcuts in cognitive information processing.
Figure 2: LLM versus human intelligence: Important mechanisms that allow flexible extension of knowledge and cognitive abilities.
Figure 3: A summary of the essence and role of each of the foundational AGI concepts covered in this work.
Figure 4: In this scene, two intelligent agents A and B assist during an emergency. When driven by high-level goals that are aligned with human interests and values, such agents can perform good acts spontaneously. Goal-awareness allows them to be proactive, autonomous and capable of attending to multiple tasks without deviating from their main essence.
Figure 5: A simplified representation of EmbodiedGPT mu2024embodiedgpt. The framework utilizes a large-scale egocentric, EgoCOT—curated as part of the work, to teach agents a wide range of embodied skills, including video captioning, visual question answering, multi-turn dialog as well as navigation and object manipulation in the physical world. It consists of four integrated components: (a) a vision-transformer to encode visual information from observations; (b) a custom submodule, so named Embodied-Former, to map input text and images (i.e., embodied instructions and visual information), and to generate relevant features for embodied, high-level planning and low-level control tasks; (c) a large language model to perform language-related tasks (e.g., image captioning, planning and embodied question answering); (d) a so-called policy network that generates low-level actions from the features learned by the Embodied-Former submodule. These actions allow the agent to physically interact with the real world using its actuators. Chain of thought approach is used to generate task-relevant goals from prompts.
...and 11 more figures

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

TL;DR

Abstract

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

Authors

TL;DR

Abstract

Table of Contents

Figures (16)