Table of Contents
Fetching ...

From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

Krittin Pachtrachai, Petmongkon Pornpichitsuwan, Wachiravit Modecrua, Touchapon Kraisingkorn

TL;DR

The paper tackles the challenge of building reliable customer-facing AI assistants from noisy historical call transcripts by proposing a transcript-driven framework that combines grading, filtering, and retrieval-augmented grounding. Knowledge is extracted from curated transcripts and stored externally, enabling RAG grounding while keeping prompts lean, and performance is guided by a spectrum of prompt architectures from Early Escalation to Governed Execution. Evaluation uses a transcript-grounded user simulator and red-teaming to measure coverage, factual accuracy, rejection behavior, and robustness, showing that the approach can autonomously handle a meaningful portion of calls (approximately 30% in challenging domains) with high factual grounding and robust safety. The findings highlight the critical role of structured, governance-oriented prompt design over domain-specific tailoring for cross-domain generalization and safe production deployment in real-world call centers.

Abstract

Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.

From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

TL;DR

The paper tackles the challenge of building reliable customer-facing AI assistants from noisy historical call transcripts by proposing a transcript-driven framework that combines grading, filtering, and retrieval-augmented grounding. Knowledge is extracted from curated transcripts and stored externally, enabling RAG grounding while keeping prompts lean, and performance is guided by a spectrum of prompt architectures from Early Escalation to Governed Execution. Evaluation uses a transcript-grounded user simulator and red-teaming to measure coverage, factual accuracy, rejection behavior, and robustness, showing that the approach can autonomously handle a meaningful portion of calls (approximately 30% in challenging domains) with high factual grounding and robust safety. The findings highlight the critical role of structured, governance-oriented prompt design over domain-specific tailoring for cross-domain generalization and safe production deployment in real-world call centers.

Abstract

Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.
Paper Structure (24 sections, 3 equations, 2 figures, 1 table)

This paper contains 24 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of the proposed transcript-to-agent pipeline. The framework illustrates transcript grading and filtering, strategic selection for knowledge coverage, knowledge extraction, prompt tuning, RAG-based response generation, and human agent transfer when escalation is required.
  • Figure 2: Performance of different prompt orchestration strategies across two industries. Prompt variants include Early Escalation with knowledge embedded in the system prompt, Low-autonomy RAG with externalized knowledge, Modular orchestration with shared step-wise actions, Programmatic orchestration using pseudo-code actions, YAML-based structured control, Protocol-based orchestration with highly compact shared behavior, and Governed execution with constrained modular actions. Coverage, factual accuracy, and rejection accuracy are reported for (a) Real Estate and (b) Specialist Recruitment.