Table of Contents
Fetching ...

Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents

Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Caiming Xiong, Shiva Kumar Pentyala, Chien-Sheng Wu

TL;DR

This work tackles the challenge of extracting structured dialog workflows from historical customer–agent conversations to improve service AI consistency. It introduces a two-stage framework that first retrieves conversations using procedural content and then applies a QA-CoT prompting strategy to generate comprehensive workflows, evaluated via an end-to-end simulation framework with agent and customer bots. The approach yields notable macro-accuracy gains across ABCD and SynthABCD datasets, with QA-CoT outperforming various prompting baselines and aligning closely with human assessments. By combining robust retrieval, structured reasoning, and scalable evaluation, the paper provides a practical foundation for building and validating dialog workflows in automated service contexts.

Abstract

Automated service agents require well-structured workflows to provide consistent and accurate responses to customer queries. However, these workflows are often undocumented, and their automatic extraction from conversations remains unexplored. In this work, we present a novel framework for extracting and evaluating dialog workflows from historical interactions. Our extraction process consists of two key stages: (1) a retrieval step to select relevant conversations based on key procedural elements, and (2) a structured workflow generation process using a question-answer-based chain-of-thought (QA-CoT) prompting. To comprehensively assess the quality of extracted workflows, we introduce an automated agent and customer bots simulation framework that measures their effectiveness in resolving customer issues. Extensive experiments on the ABCD and SynthABCD datasets demonstrate that our QA-CoT technique improves workflow extraction by 12.16\% in average macro accuracy over the baseline. Moreover, our evaluation method closely aligns with human assessments, providing a reliable and scalable framework for future research.

Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents

TL;DR

This work tackles the challenge of extracting structured dialog workflows from historical customer–agent conversations to improve service AI consistency. It introduces a two-stage framework that first retrieves conversations using procedural content and then applies a QA-CoT prompting strategy to generate comprehensive workflows, evaluated via an end-to-end simulation framework with agent and customer bots. The approach yields notable macro-accuracy gains across ABCD and SynthABCD datasets, with QA-CoT outperforming various prompting baselines and aligning closely with human assessments. By combining robust retrieval, structured reasoning, and scalable evaluation, the paper provides a practical foundation for building and validating dialog workflows in automated service contexts.

Abstract

Automated service agents require well-structured workflows to provide consistent and accurate responses to customer queries. However, these workflows are often undocumented, and their automatic extraction from conversations remains unexplored. In this work, we present a novel framework for extracting and evaluating dialog workflows from historical interactions. Our extraction process consists of two key stages: (1) a retrieval step to select relevant conversations based on key procedural elements, and (2) a structured workflow generation process using a question-answer-based chain-of-thought (QA-CoT) prompting. To comprehensively assess the quality of extracted workflows, we introduce an automated agent and customer bots simulation framework that measures their effectiveness in resolving customer issues. Extensive experiments on the ABCD and SynthABCD datasets demonstrate that our QA-CoT technique improves workflow extraction by 12.16\% in average macro accuracy over the baseline. Moreover, our evaluation method closely aligns with human assessments, providing a reliable and scalable framework for future research.

Paper Structure

This paper contains 31 sections, 3 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: An example showing the derivation of a workflow from historical conversations. Full workflow is shown in Figs. \ref{['fig:retriever_1_2']} and \ref{['fig:eval_1_2']}.
  • Figure 2: An example of procedural elements extracted from a conversation by the GPT-4o mini LLM.
  • Figure 3: (a) An example workflow with procedural elements for 3 of the 10 possible distinct sub-flows. Sub-flows and their procedural elements are color-coded to match. (b) Steps in our proposed procedural element-based retriever. An example of complete procedural elements extracted from a conversation is shown in Fig. \ref{['procedural-elements']}.
  • Figure 4: A snippet of the QA chain-of-thought generated by the GPT-4o model for the $return\_color$ intent using conversations from the ABCD data. The extracted QA pairs highlight key preconditions based on membership level.
  • Figure 5: (a) Flowchart of the workflow in Fig. \ref{['fig:retriever1']}, illustrating 10 possible customer scenarios [Step 1, E2E pipeline], along with an example of user information, system information, and the success criteria for one scenario [Steps 2 and 3]. (b) Simulation of the user (system) bot based on intent and user information (system information and the predicted workflow) [Step 4], followed by a final evaluation of dialogue success [Step 5].
  • ...and 21 more figures