From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users
Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam
TL;DR
The paper surveys recent advances (2023–2025) in LLMs as autonomous agents and tool users, proposing a comprehensive taxonomy across architectures, tool integration, cognition, prompting, and evaluation. It synthesizes evidence on single- and multi-agent frameworks, reasoning, planning, and memory, and analyzes how prompting, fine-tuning, and memory augmentation affect agent autonomy and grounding. A key contribution is a critical assessment of current benchmarks and 68 public datasets, highlighting gaps in verifiable reasoning, self-improvement, and personalization, and outlining ten future directions. The findings underscore that external tool access, structured memory, and hybrid prompting/fine-tuning strategies are central to scalable, safe, and effective agent systems with broad domain impact (healthcare, biology, engineering, robotics). Overall, the review provides a foundation for advancing robust, interpretable, and human-centered LLM-based agents.
Abstract
The pursuit of human-level artificial intelligence (AI) has significantly advanced the development of autonomous agents and Large Language Models (LLMs). LLMs are now widely utilized as decision-making agents for their ability to interpret instructions, manage sequential tasks, and adapt through feedback. This review examines recent developments in employing LLMs as autonomous agents and tool users and comprises seven research questions. We only used the papers published between 2023 and 2025 in conferences of the A* and A rank and Q1 journals. A structured analysis of the LLM agents' architectural design principles, dividing their applications into single-agent and multi-agent systems, and strategies for integrating external tools is presented. In addition, the cognitive mechanisms of LLM, including reasoning, planning, and memory, and the impact of prompting methods and fine-tuning procedures on agent performance are also investigated. Furthermore, we evaluated current benchmarks and assessment protocols and have provided an analysis of 68 publicly available datasets to assess the performance of LLM-based agents in various tasks. In conducting this review, we have identified critical findings on verifiable reasoning of LLMs, the capacity for self-improvement, and the personalization of LLM-based agents. Finally, we have discussed ten future research directions to overcome these gaps.
