Table of Contents
Fetching ...

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Yongkang Liu, Ercong Nie, Shi Feng, Zheng Hua, Zifeng Ding, Daling Wang, Yifei Zhang, Hinrich Schütze

TL;DR

The paper tackles the challenge of building high-quality dialogue systems in low-resource, multi-domain settings by introducing AMD^2G, a data augmentation framework that uses de-domaining and a two-stage training procedure. Domain dictionaries, built from LLM-driven keyword extraction and lexical banks, enable replacement of domain-specific terms with placeholders, producing domain-agnostic corpora that reveal shared expression patterns. The framework then performs domain-agnostic training followed by domain adaptation to the target domain, achieving consistent gains over target-domain-only and joint-domain baselines across five Chinese domains and multiple model types. These findings highlight AMD^2G as a practical approach for cross-domain transfer in low-resource scenarios, with public code and data enhancing reproducibility and application potential.

Abstract

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbf{A}ugmentation framework for \textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred to as \textbf{AMD$^2$G}. The AMD$^2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD$^2$G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD$^2$G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository$^{\text 1}$.

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

TL;DR

The paper tackles the challenge of building high-quality dialogue systems in low-resource, multi-domain settings by introducing AMD^2G, a data augmentation framework that uses de-domaining and a two-stage training procedure. Domain dictionaries, built from LLM-driven keyword extraction and lexical banks, enable replacement of domain-specific terms with placeholders, producing domain-agnostic corpora that reveal shared expression patterns. The framework then performs domain-agnostic training followed by domain adaptation to the target domain, achieving consistent gains over target-domain-only and joint-domain baselines across five Chinese domains and multiple model types. These findings highlight AMD^2G as a practical approach for cross-domain transfer in low-resource scenarios, with public code and data enhancing reproducibility and application potential.

Abstract

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbf{A}ugmentation framework for \textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred to as \textbf{AMDG}. The AMDG framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMDG achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMDG as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository.
Paper Structure (21 sections, 1 equation, 3 figures, 6 tables)

This paper contains 21 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Illustration of corpus composition in different domains. (a) represents domain-specific corpora, (b) stands for domain-independent corpora. The overlap of Domain A (blue) and Domain B (Orange) represents domain-agnostic data, while non-overlapping regions signify domain-specific data.
  • Figure 2: Schematic diagram of AMD$^2$G framework. The target domain is E-Commerce and the domains used for de-domaining are Film, Music, Travel, and Medical. $\$P$ represents the placeholder. The method supports both encoder-decoder and decoder-only structures.
  • Figure 3: The first 5 pictures show the trend of average performance and the trend of PPL as the training data changes in five domains. The last one is n-gram similarity score (i.e., Uni, Bi, Tri and Quad) and average performance gain trend (i.e., DeltaScore) of models based on AMD$^2$G compared to direct training on target domain corpus. To highlight the trend, we multiply the DeltaScore value by 1000.