Table of Contents
Fetching ...

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Yufei He, Ruoyu Li, Alex Chen, Yue Liu, Yulin Chen, Yuan Sui, Cheng Chen, Yi Zhu, Luca Luo, Frank Yang, Bryan Hooi

TL;DR

ARIA introduces a general framework for test-time learning in LLM agents by coupling structured self-reflection with human-in-the-loop guidance to maintain an evolving, timestamped knowledge base. The Intelligent Guidance Solicitation module detects uncertainty through self-dialogue and formulates targeted queries, while the Human-Guided Knowledge Adaptation updates the knowledge base and resolves conflicts, guided by a temporally-informed retrieval mechanism. Empirical results on TikTok Pay CDD name screening and the CUAD legal dataset show ARIA outperforms offline fine-tuning, RAG, and various self-improvement baselines, achieving higher accuracy and efficiency, including substantial reductions in average handling time. The work demonstrates practical viability in rapidly changing environments and highlights considerations around expert availability, knowledge-base complexity, and ethical deployment. Overall, ARIA offers a principled path to sustaining high-performance, domain-sensitive AI agents in production by integrating real-time human expertise with disciplined knowledge management.

Abstract

Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

TL;DR

ARIA introduces a general framework for test-time learning in LLM agents by coupling structured self-reflection with human-in-the-loop guidance to maintain an evolving, timestamped knowledge base. The Intelligent Guidance Solicitation module detects uncertainty through self-dialogue and formulates targeted queries, while the Human-Guided Knowledge Adaptation updates the knowledge base and resolves conflicts, guided by a temporally-informed retrieval mechanism. Empirical results on TikTok Pay CDD name screening and the CUAD legal dataset show ARIA outperforms offline fine-tuning, RAG, and various self-improvement baselines, achieving higher accuracy and efficiency, including substantial reductions in average handling time. The work demonstrates practical viability in rapidly changing environments and highlights considerations around expert availability, knowledge-base complexity, and ethical deployment. Overall, ARIA offers a principled path to sustaining high-performance, domain-sensitive AI agents in production by integrating real-time human expertise with disciplined knowledge management.

Abstract

Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.

Paper Structure

This paper contains 41 sections, 3 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview of the ARIA framework. The agent processes input, assesses the need for guidance via self-reflection, and can solicit human expert feedback. This feedback is integrated into an evolving knowledge repository, enabling learning at test time.
  • Figure 2: Illustrative example of the Intelligent Guidance Solicitation (IGS) process.
  • Figure 3: Illustrative example of the Conflict Detection and Resolution process within HGKA.
  • Figure 4: Illustrative example of the Active Clarification query generation process within HGKA.
  • Figure 5: Part1: Illustrative example of ARIA's review process for a CDD case involving Malay name structure and DOB discrepancy.
  • ...and 8 more figures