Fundamentals of Building Autonomous LLM Agents
Victor de Lamo Castrillo, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll
TL;DR
The paper surveys how to construct autonomous LLM agents by integrating perception, memory, reasoning, planning, and execution into modular architectures. It highlights methods such as Chain-of-Thought and Tree-of-Thought, along with DPPM-style decomposition and reflection, and emphasizes the role of expert ensembles in scaling reasoning. It reviews perception modalities (text, multimodal, structured data, and tool-based), memory strategies (RAG and long-term storage), and multimodal execution (UI automation, code, and robotics), noting challenges like GUI grounding, latency, and context-window limits. The work underscores the practical significance of modular, memory-aware LLM agents for complex, real-world automation and decision-making, while indicating avenues for future enhancement, including one-shot learning and human-in-the-loop setups.
Abstract
This paper reviews the architecture and implementation methods of agents powered by large language models (LLMs). Motivated by the limitations of traditional LLMs in real-world tasks, the research aims to explore patterns to develop "agentic" LLMs that can automate complex tasks and bridge the performance gap with human capabilities. Key components include a perception system that converts environmental percepts into meaningful representations; a reasoning system that formulates plans, adapts to feedback, and evaluates actions through different techniques like Chain-of-Thought and Tree-of-Thought; a memory system that retains knowledge through both short-term and long-term mechanisms; and an execution system that translates internal decisions into concrete actions. This paper shows how integrating these systems leads to more capable and generalized software bots that mimic human cognitive processes for autonomous and intelligent behavior.
