A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Yongkang Liu; Ercong Nie; Shi Feng; Zheng Hua; Zifeng Ding; Daling Wang; Yifei Zhang; Hinrich Schütze

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Yongkang Liu, Ercong Nie, Shi Feng, Zheng Hua, Zifeng Ding, Daling Wang, Yifei Zhang, Hinrich Schütze

TL;DR

The paper tackles the challenge of building high-quality dialogue systems in low-resource, multi-domain settings by introducing AMD^2G, a data augmentation framework that uses de-domaining and a two-stage training procedure. Domain dictionaries, built from LLM-driven keyword extraction and lexical banks, enable replacement of domain-specific terms with placeholders, producing domain-agnostic corpora that reveal shared expression patterns. The framework then performs domain-agnostic training followed by domain adaptation to the target domain, achieving consistent gains over target-domain-only and joint-domain baselines across five Chinese domains and multiple model types. These findings highlight AMD^2G as a practical approach for cross-domain transfer in low-resource scenarios, with public code and data enhancing reproducibility and application potential.

Abstract

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbf{A}ugmentation framework for \textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred to as \textbf{AMD$^2$G}. The AMD$^2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD$^2$G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD$^2$G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository$^{\text 1}$.

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

TL;DR

Abstract

G}. The AMD

G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD

G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD

G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository

Paper Structure (21 sections, 1 equation, 3 figures, 6 tables)

This paper contains 21 sections, 1 equation, 3 figures, 6 tables.

Introduction
Methodology
Problem Formulation
De-Domaining Data Processing
De-Domaining
Dictionary Construction
Domain-Agnostic Training and Domain Adaptation
Domain Similarity
Experiments
Datasets
Models
Baselines
Implementation Details
Evaluation metrics
Results and Analysis
...and 6 more sections

Figures (3)

Figure 1: Illustration of corpus composition in different domains. (a) represents domain-specific corpora, (b) stands for domain-independent corpora. The overlap of Domain A (blue) and Domain B (Orange) represents domain-agnostic data, while non-overlapping regions signify domain-specific data.
Figure 2: Schematic diagram of AMD$^2$G framework. The target domain is E-Commerce and the domains used for de-domaining are Film, Music, Travel, and Medical. $\$P$ represents the placeholder. The method supports both encoder-decoder and decoder-only structures.
Figure 3: The first 5 pictures show the trend of average performance and the trend of PPL as the training data changes in five domains. The last one is n-gram similarity score (i.e., Uni, Bi, Tri and Quad) and average performance gain trend (i.e., DeltaScore) of models based on AMD$^2$G compared to direct training on target domain corpus. To highlight the trend, we multiply the DeltaScore value by 1000.

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

TL;DR

Abstract

A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)