From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants
Krittin Pachtrachai, Petmongkon Pornpichitsuwan, Wachiravit Modecrua, Touchapon Kraisingkorn
TL;DR
The paper tackles the challenge of building reliable customer-facing AI assistants from noisy historical call transcripts by proposing a transcript-driven framework that combines grading, filtering, and retrieval-augmented grounding. Knowledge is extracted from curated transcripts and stored externally, enabling RAG grounding while keeping prompts lean, and performance is guided by a spectrum of prompt architectures from Early Escalation to Governed Execution. Evaluation uses a transcript-grounded user simulator and red-teaming to measure coverage, factual accuracy, rejection behavior, and robustness, showing that the approach can autonomously handle a meaningful portion of calls (approximately 30% in challenging domains) with high factual grounding and robust safety. The findings highlight the critical role of structured, governance-oriented prompt design over domain-specific tailoring for cross-domain generalization and safe production deployment in real-world call centers.
Abstract
Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.
