Does Collaborative Human-LM Dialogue Generation Help Information Extraction from Human Dialogues?
Bo-Ru Lu, Nikita Haduong, Chia-Hsuan Lee, Zeqiu Wu, Hao Cheng, Paul Koester, Jean Utke, Tao Yu, Noah A. Smith, Mari Ostendorf
TL;DR
This work tackles the privacy barrier in sharing human-human dialogues by proposing DialGen, a human-in-the-loop dialogue generation framework that synthesizes long, complex call-center conversations for information extraction. DialGen combines a language model with human reviewers to generate, edit, and annotate synthetic dialogues guided by an ontology, enabling controlled coverage of diverse entity-slot-value information. The authors introduce an entity-centric IE scoring scheme and demonstrate that synthetic data, when used with real conversations, significantly improves F1 on private auto-insurance IE tasks, with notable gains in recall and slot-value accuracy. The approach yields a practical pathway to develop rich, privacy-preserving dialogue datasets and can enhance information extraction in privacy-constrained domains.
Abstract
The capabilities of pretrained language models have opened opportunities to explore new application areas, but applications involving human-human interaction are limited by the fact that most data is protected from public release for privacy reasons. Problem-solving human dialogues in real applications can be much more complex than existing Wizard-of-Oz collections, preventing successful domain transfer. To support information extraction (IE) for a private call center dataset, we introduce a human-in-the-loop dialogue generation framework capable of synthesizing realistic dialogues. In IE experiments with auto insurance call center dialogues, we observe 25\% relative improvement in $F_1$ after augmenting a small set of real human conversations with synthetic data. We release code and our synthetic dataset to illustrate the complexity of real-world call center conversations and encourage development of complex dialogue datasets that are more representative of natural data.
