Table of Contents
Fetching ...

Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models

Anjali R. Menon, Rohit K. Sharma, Priya Singh, Chengyu Wang, Aurora M. Ferreira, Mateja Novak

TL;DR

This work addresses the brittleness of open-loop LLM-based robotics by proposing Think-Act-Learn (T-A-L), a closed-loop framework where an embodied agent reasons with an LLM, executes actions via a structured GUI representation, and learns from rich multimodal feedback. GUI-Learner decomposes UI understanding and action generation into a Perception module and a Transformer-based Decision module, trained with a two-phase Hybrid Learning strategy: Behavioral Cloning for warm-start followed by offline reinforcement learning (IQL) from self-generated data. The approach yields superior performance over open-loop LLMs, BC, and end-to-end baselines across web and desktop tasks, with strong generalization to unseen applications and robust failure recovery demonstrated through ablations and qualitative case studies. The results suggest that combining structured UI groundings with offline self-improvement enables efficient, robust, and generalist UI automation, advancing toward autonomous agents capable of operating across diverse software platforms in real-world settings.

Abstract

The integration of Large Language Models (LLMs) into robotics has unlocked unprecedented capabilities in high-level task planning. However, most current systems operate in an open-loop fashion, where LLMs act as one-shot planners, rendering them brittle and unable to adapt to unforeseen circumstances in dynamic physical environments. To overcome this limitation, this paper introduces the "Think, Act, Learn" (T-A-L) framework, a novel architecture that enables an embodied agent to autonomously learn and refine its policies through continuous interaction. Our framework establishes a closed-loop cycle where an LLM first "thinks" by decomposing high-level commands into actionable plans. The robot then "acts" by executing these plans while gathering rich, multimodal sensory feedback. Critically, the "learn" module processes this feedback to facilitate LLM-driven self-reflection, allowing the agent to perform causal analysis on its failures and generate corrective strategies. These insights are stored in an experiential memory to guide future planning cycles. We demonstrate through extensive experiments in both simulation and the real world that our T-A-L agent significantly outperforms baseline methods, including open-loop LLMs, Behavioral Cloning, and traditional Reinforcement Learning. Our framework achieves over a 97% success rate on complex, long-horizon tasks, converges to a stable policy in an average of just 9 trials, and exhibits remarkable generalization to unseen tasks. This work presents a significant step towards developing more robust, adaptive, and truly autonomous robotic agents.

Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models

TL;DR

This work addresses the brittleness of open-loop LLM-based robotics by proposing Think-Act-Learn (T-A-L), a closed-loop framework where an embodied agent reasons with an LLM, executes actions via a structured GUI representation, and learns from rich multimodal feedback. GUI-Learner decomposes UI understanding and action generation into a Perception module and a Transformer-based Decision module, trained with a two-phase Hybrid Learning strategy: Behavioral Cloning for warm-start followed by offline reinforcement learning (IQL) from self-generated data. The approach yields superior performance over open-loop LLMs, BC, and end-to-end baselines across web and desktop tasks, with strong generalization to unseen applications and robust failure recovery demonstrated through ablations and qualitative case studies. The results suggest that combining structured UI groundings with offline self-improvement enables efficient, robust, and generalist UI automation, advancing toward autonomous agents capable of operating across diverse software platforms in real-world settings.

Abstract

The integration of Large Language Models (LLMs) into robotics has unlocked unprecedented capabilities in high-level task planning. However, most current systems operate in an open-loop fashion, where LLMs act as one-shot planners, rendering them brittle and unable to adapt to unforeseen circumstances in dynamic physical environments. To overcome this limitation, this paper introduces the "Think, Act, Learn" (T-A-L) framework, a novel architecture that enables an embodied agent to autonomously learn and refine its policies through continuous interaction. Our framework establishes a closed-loop cycle where an LLM first "thinks" by decomposing high-level commands into actionable plans. The robot then "acts" by executing these plans while gathering rich, multimodal sensory feedback. Critically, the "learn" module processes this feedback to facilitate LLM-driven self-reflection, allowing the agent to perform causal analysis on its failures and generate corrective strategies. These insights are stored in an experiential memory to guide future planning cycles. We demonstrate through extensive experiments in both simulation and the real world that our T-A-L agent significantly outperforms baseline methods, including open-loop LLMs, Behavioral Cloning, and traditional Reinforcement Learning. Our framework achieves over a 97% success rate on complex, long-horizon tasks, converges to a stable policy in an average of just 9 trials, and exhibits remarkable generalization to unseen tasks. This work presents a significant step towards developing more robust, adaptive, and truly autonomous robotic agents.

Paper Structure

This paper contains 43 sections, 8 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of the proposed GUI-Learner framework. At each step, the Perception Module takes the raw screen pixels and the user's instruction, processing them into a structured UI representation. This representation, which includes identified elements and their properties, is then fed to the Transformer-based Decision Module. The Decision Module selects the next action to execute. The entire agent is trained end-to-end using a hybrid strategy that combines behavioral cloning from expert data and offline reinforcement learning from self-generated interaction data.
  • Figure 2: Learning curve illustrating the two-phase hybrid strategy. Behavioral Cloning quickly warms up the policy, while Offline RL further pushes performance.
  • Figure 3: Main results on Web and Desktop environments: Task Success Rate (SR). GUI-Learner significantly outperforms all baselines.
  • Figure 4: Action Efficiency (AE) comparison on Web and Desktop environments. Our agent reaches near-human efficiency.
  • Figure 5: Ablation study on the Desktop benchmark (Success Rate). Removing either the offline RL phase, the structured UI representation, or expert pre-training causes a large drop in performance.
  • ...and 2 more figures