Table of Contents
Fetching ...

Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction

Sergio Burdisso, Srikanth Madikeri, Petr Motlicek

Abstract

Efficiently deriving structured workflows from unannotated dialogs remains an underexplored and formidable challenge in computational linguistics. Automating this process could significantly accelerate the manual design of workflows in new domains and enable the grounding of large language models in domain-specific flowcharts, enhancing transparency and controllability. In this paper, we introduce Dialog2Flow (D2F) embeddings, which differ from conventional sentence embeddings by mapping utterances to a latent space where they are grouped according to their communicative and informative functions (i.e., the actions they represent). D2F allows for modeling dialogs as continuous trajectories in a latent space with distinct action-related regions. By clustering D2F embeddings, the latent space is quantized, and dialogs can be converted into sequences of region/action IDs, facilitating the extraction of the underlying workflow. To pre-train D2F, we build a comprehensive dataset by unifying twenty task-oriented dialog datasets with normalized per-turn action annotations. We also introduce a novel soft contrastive loss that leverages the semantic information of these actions to guide the representation learning process, showing superior performance compared to standard supervised contrastive loss. Evaluation against various sentence embeddings, including dialog-specific ones, demonstrates that D2F yields superior qualitative and quantitative results across diverse domains.

Dialog2Flow: Pre-training Soft-Contrastive Action-Driven Sentence Embeddings for Automatic Dialog Flow Extraction

Abstract

Efficiently deriving structured workflows from unannotated dialogs remains an underexplored and formidable challenge in computational linguistics. Automating this process could significantly accelerate the manual design of workflows in new domains and enable the grounding of large language models in domain-specific flowcharts, enhancing transparency and controllability. In this paper, we introduce Dialog2Flow (D2F) embeddings, which differ from conventional sentence embeddings by mapping utterances to a latent space where they are grouped according to their communicative and informative functions (i.e., the actions they represent). D2F allows for modeling dialogs as continuous trajectories in a latent space with distinct action-related regions. By clustering D2F embeddings, the latent space is quantized, and dialogs can be converted into sequences of region/action IDs, facilitating the extraction of the underlying workflow. To pre-train D2F, we build a comprehensive dataset by unifying twenty task-oriented dialog datasets with normalized per-turn action annotations. We also introduce a novel soft contrastive loss that leverages the semantic information of these actions to guide the representation learning process, showing superior performance compared to standard supervised contrastive loss. Evaluation against various sentence embeddings, including dialog-specific ones, demonstrates that D2F yields superior qualitative and quantitative results across diverse domains.

Paper Structure

This paper contains 27 sections, 7 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Example segment of the dialog SNG1533 from the hospital domain of the SpokenWOZ dataset. Actions are defined by concatenating the dialog act label (in bold) with the slot label(s) associated to each utterance.
  • Figure 2: Directed graph representing the hospital domain workflow obtained from all the hospital dialogs in the SpokenWOZ dataset. Nodes correspond to individual actions. The width of edges and the underline thickness of nodes indicate their frequency. User actions are colored to distinguish them from system actions.
  • Figure 3: Spherical Voronoi diagram of embeddings projected onto the unit sphere using UMAP with cosine distance as the metric. The embeddings represent system utterances from the hotel domain of the MultiWOZ2.1 dataset. Legends indicate the ground-truth action associated to each embedding and the centroids used to generate the partitions for all the actions in this domain.
  • Figure 4: $\hat{G}_{hospital}$ graph obtained with D2F$_{joint}$ containing only one node less than the reference graph in Figure \ref{['fig:gt-graph']}. Node labels correspond to the cluster ID along a representative utterance (the closest to the cluster centroid). Although not the exact same graph as the reference, this graph still allows us to understand the common flow of the conversations with a similar degree of detail: first, the user and system greet each other (U0 and S6), then the user inform the reason of the call requesting the phone number of a department (U4), the agent may confirm the department (S7) or request more information (S4) before providing the phone number (S2). The user may then either confirm the number (U3) or thank the system (U5). Finally, the system asks if anything else is required (S5), to which the user may either finish the conversation (U6) or, more likely, thank the system (U2) before the system says goodbye (S0).
  • Figure A1: $\hat{G}_{hospital}$ graph obtained with Sentence-BERT (8 nodes/actions in total). Node labels correspond to the cluster ID along a representative utterance (the closest to the cluster centroid).
  • ...and 4 more figures