A Roadmap for Embodied and Social Grounding in LLMs
Sara Incao, Carlo Mazzola, Giulia Belgiovine, Alessandra Sciutti
TL;DR
The paper addresses the problem that language models alone cannot ground meaning in physical and social environments. It proposes a roadmap for LLM grounding centered on three pillars: an active bodily system for embodied experience, temporally structured interaction via Predictive Processing and active inference, and social skills to establish common ground. It surveys the state of the art, identifies gaps in multimodal, embodiment-aware grounding, and discusses how body, time, and social interaction together enable more robust human-robot interaction. The work provides a theoretical framework and design considerations to guide the development of LLM-enabled embodied robots capable of meaningful perception-action loops and shared understanding with humans, thereby increasing the practical impact of robotics and AI in real-world settings.
Abstract
The fusion of Large Language Models (LLMs) and robotic systems has led to a transformative paradigm in the robotic field, offering unparalleled capabilities not only in the communication domain but also in skills like multimodal input handling, high-level reasoning, and plan generation. The grounding of LLMs knowledge into the empirical world has been considered a crucial pathway to exploit the efficiency of LLMs in robotics. Nevertheless, connecting LLMs' representations to the external world with multimodal approaches or with robots' bodies is not enough to let them understand the meaning of the language they are manipulating. Taking inspiration from humans, this work draws attention to three necessary elements for an agent to grasp and experience the world. The roadmap for LLMs grounding is envisaged in an active bodily system as the reference point for experiencing the environment, a temporally structured experience for a coherent, self-related interaction with the external world, and social skills to acquire a common-grounded shared experience.
