Table of Contents
Fetching ...

An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification

Zhuowei Chen, Lianxi Wang, Yuben Wu, Xinfeng Liao, Yujia Tian, Junyang Zhong

TL;DR

This work tackles sentiment classification under low-resource conditions where domain shift and label imbalance hinder performance. It introduces DiffusionCLS, a diffusion LM–based data augmentation framework with Label-Aware Noise Schedule and Label-Aware Prompting to reconstruct strong label-related tokens, balancing diversity and label-consistency. A noise-resistant training objective, incorporating contrastive learning alongside cross-entropy, mitigates noise from pseudo samples. Experiments across domain-specific and multilingual datasets show consistent improvements over strong baselines, with ablations and visualizations clarifying the diversity-consistency trade-off and practical deployment considerations.

Abstract

Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.

An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification

TL;DR

This work tackles sentiment classification under low-resource conditions where domain shift and label imbalance hinder performance. It introduces DiffusionCLS, a diffusion LM–based data augmentation framework with Label-Aware Noise Schedule and Label-Aware Prompting to reconstruct strong label-related tokens, balancing diversity and label-consistency. A noise-resistant training objective, incorporating contrastive learning alongside cross-entropy, mitigates noise from pseudo samples. Experiments across domain-specific and multilingual datasets show consistent improvements over strong baselines, with ablations and visualizations clarifying the diversity-consistency trade-off and practical deployment considerations.

Abstract

Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
Paper Structure (24 sections, 6 equations, 7 figures, 9 tables)

This paper contains 24 sections, 6 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of the proposed method. DiffusionCLS comprises four core components: Label-Aware Noise Schedule, Label-Aware Prompting, Conditional Sample Generation, and Noise-Resistant Training.
  • Figure 2: The probability of a token remaining unmasked, with $\lambda$ set to 0.5.
  • Figure 3: Label-Aware Prompting, each masked sequence is concatenated with their corresponding label.
  • Figure 4: Noise-resistant contrastive learning. Cross points are generated samples while round dots denote original samples. Train-with-noise objective aiming at enlarging the gap between original samples with different labels.
  • Figure 5: Performances of SC models on dataset SenWave under the partial data setting. Red lines denote the raw PLM results and blue lines represent models trained with DiffusionCLS.
  • ...and 2 more figures