Table of Contents
Fetching ...

Get Experience from Practice: LLM Agents with Record & Replay

Erhu Feng, Wenbo Zhou, Zibin Liu, Le Chen, Yunpeng Dong, Cheng Zhang, Yisheng Zhao, Dong Du, Zhichao Hua, Yubin Xia, Haibo Chen

TL;DR

This work tackles four major drawbacks of contemporary LLM-driven agents—unreliability due to hallucinations, privacy concerns, high operational costs, and slow execution. It introduces AgentRR, a record-and-replay framework that records interactions, abstracts them into multi-level experiences, and replays them to guide future tasks, anchored by check functions that enforce safety. Key contributions include the multi-level experience design, a formal check-function mechanism as a Trusted Computing Base, and an Experience Store to enable knowledge sharing. A form-filling case study demonstrates improved speed and reliability over existing approaches, while the discussion outlines limitations and directions for refining summarization, state representations, and safety guarantees.

Abstract

AI agents, empowered by Large Language Models (LLMs) and communication protocols such as MCP and A2A, have rapidly evolved from simple chatbots to autonomous entities capable of executing complex, multi-step tasks, demonstrating great potential. However, the LLMs' inherent uncertainty and heavy computational resource requirements pose four significant challenges to the development of safe and efficient agents: reliability, privacy, cost and performance. Existing approaches, like model alignment, workflow constraints and on-device model deployment, can partially alleviate some issues but often with limitations, failing to fundamentally resolve these challenges. This paper proposes a new paradigm called AgentRR (Agent Record & Replay), which introduces the classical record-and-replay mechanism into AI agent frameworks. The core idea is to: 1. Record an agent's interaction trace with its environment and internal decision process during task execution, 2. Summarize this trace into a structured "experience" encapsulating the workflow and constraints, and 3. Replay these experiences in subsequent similar tasks to guide the agent's behavior. We detail a multi-level experience abstraction method and a check function mechanism in AgentRR: the former balances experience specificity and generality, while the latter serves as a trust anchor to ensure completeness and safety during replay. In addition, we explore multiple application modes of AgentRR, including user-recorded task demonstration, large-small model collaboration and privacy-aware agent execution, and envision an experience repository for sharing and reusing knowledge to further reduce deployment cost.

Get Experience from Practice: LLM Agents with Record & Replay

TL;DR

This work tackles four major drawbacks of contemporary LLM-driven agents—unreliability due to hallucinations, privacy concerns, high operational costs, and slow execution. It introduces AgentRR, a record-and-replay framework that records interactions, abstracts them into multi-level experiences, and replays them to guide future tasks, anchored by check functions that enforce safety. Key contributions include the multi-level experience design, a formal check-function mechanism as a Trusted Computing Base, and an Experience Store to enable knowledge sharing. A form-filling case study demonstrates improved speed and reliability over existing approaches, while the discussion outlines limitations and directions for refining summarization, state representations, and safety guarantees.

Abstract

AI agents, empowered by Large Language Models (LLMs) and communication protocols such as MCP and A2A, have rapidly evolved from simple chatbots to autonomous entities capable of executing complex, multi-step tasks, demonstrating great potential. However, the LLMs' inherent uncertainty and heavy computational resource requirements pose four significant challenges to the development of safe and efficient agents: reliability, privacy, cost and performance. Existing approaches, like model alignment, workflow constraints and on-device model deployment, can partially alleviate some issues but often with limitations, failing to fundamentally resolve these challenges. This paper proposes a new paradigm called AgentRR (Agent Record & Replay), which introduces the classical record-and-replay mechanism into AI agent frameworks. The core idea is to: 1. Record an agent's interaction trace with its environment and internal decision process during task execution, 2. Summarize this trace into a structured "experience" encapsulating the workflow and constraints, and 3. Replay these experiences in subsequent similar tasks to guide the agent's behavior. We detail a multi-level experience abstraction method and a check function mechanism in AgentRR: the former balances experience specificity and generality, while the latter serves as a trust anchor to ensure completeness and safety during replay. In addition, we explore multiple application modes of AgentRR, including user-recorded task demonstration, large-small model collaboration and privacy-aware agent execution, and envision an experience repository for sharing and reusing knowledge to further reduce deployment cost.

Paper Structure

This paper contains 19 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Conceptual comparison of human, R&R tools, LLM agents, and AgentRR in task execution.
  • Figure 2: Multi-level experience: High-level experience describes the task planning process without being bound to specific platforms or UI layouts. Low-level experience contains more detailed action decomposition and may be coupled with specific platforms and UI layouts.
  • Figure 3: The overall architecture of AgentRR: The AgentRR system consists of three core components: the Record module, Summary module, and Replay module. Additionally, to facilitate experience sharing across different users, AgentRR incorporates an experience store.
  • Figure 4: The online form filling example: During the record phase, users capture multiple trace behaviors. In the summary phase, these traces are synthesized into multi-level experiences based on both scripts and natural language descriptions. During replay, the system selects the optimal experience and, in conjunction with the specific task requirements, controls web interactions to complete the form.