Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models
Anjali R. Menon, Rohit K. Sharma, Priya Singh, Chengyu Wang, Aurora M. Ferreira, Mateja Novak
TL;DR
This work addresses the brittleness of open-loop LLM-based robotics by proposing Think-Act-Learn (T-A-L), a closed-loop framework where an embodied agent reasons with an LLM, executes actions via a structured GUI representation, and learns from rich multimodal feedback. GUI-Learner decomposes UI understanding and action generation into a Perception module and a Transformer-based Decision module, trained with a two-phase Hybrid Learning strategy: Behavioral Cloning for warm-start followed by offline reinforcement learning (IQL) from self-generated data. The approach yields superior performance over open-loop LLMs, BC, and end-to-end baselines across web and desktop tasks, with strong generalization to unseen applications and robust failure recovery demonstrated through ablations and qualitative case studies. The results suggest that combining structured UI groundings with offline self-improvement enables efficient, robust, and generalist UI automation, advancing toward autonomous agents capable of operating across diverse software platforms in real-world settings.
Abstract
The integration of Large Language Models (LLMs) into robotics has unlocked unprecedented capabilities in high-level task planning. However, most current systems operate in an open-loop fashion, where LLMs act as one-shot planners, rendering them brittle and unable to adapt to unforeseen circumstances in dynamic physical environments. To overcome this limitation, this paper introduces the "Think, Act, Learn" (T-A-L) framework, a novel architecture that enables an embodied agent to autonomously learn and refine its policies through continuous interaction. Our framework establishes a closed-loop cycle where an LLM first "thinks" by decomposing high-level commands into actionable plans. The robot then "acts" by executing these plans while gathering rich, multimodal sensory feedback. Critically, the "learn" module processes this feedback to facilitate LLM-driven self-reflection, allowing the agent to perform causal analysis on its failures and generate corrective strategies. These insights are stored in an experiential memory to guide future planning cycles. We demonstrate through extensive experiments in both simulation and the real world that our T-A-L agent significantly outperforms baseline methods, including open-loop LLMs, Behavioral Cloning, and traditional Reinforcement Learning. Our framework achieves over a 97% success rate on complex, long-horizon tasks, converges to a stable policy in an average of just 9 trials, and exhibits remarkable generalization to unseen tasks. This work presents a significant step towards developing more robust, adaptive, and truly autonomous robotic agents.
