Table of Contents
Fetching ...

Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning

Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, Tarek Abdelzaher, Jiawei Han

TL;DR

FewGen tackles the data scarcity hurdle in few-shot NLU by first training a label-conditioned generator on the limited data and then synthesizing large volumes of label-discriminative training samples. A meta-weighted generator objective automatically learns token-level weights to emphasize discriminative cues, while a noise-robust classifier fine-tunes on both real and synthetic data with label smoothing and temporal ensembling. Across seven GLUE tasks, FewGen outperforms non-augmentation baselines by over 5 points on average and exceeds augmentation methods by about 3 points, with ablations validating the importance of the meta-weighting and robust fine-tuning components. The approach demonstrates that carefully guided synthetic data can substantially close the gap between few-shot and fully supervised performance, offering a practical augmentation strategy for NLP under data constraints.

Abstract

Recent studies have revealed the intriguing few-shot learning ability of pretrained language models (PLMs): They can quickly adapt to a new task when fine-tuned on a small amount of labeled data formulated as prompts, without requiring abundant task-specific annotations. Despite their promising performance, most existing few-shot approaches that only learn from the small training set still underperform fully supervised training by nontrivial margins. In this work, we study few-shot learning with PLMs from a different perspective: We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples which augment the original training set. To encourage the generator to produce label-discriminative samples, we train it via weighted maximum likelihood where the weight of each token is automatically adjusted based on a discriminative meta-learning objective. A classification PLM can then be fine-tuned on both the few-shot and the synthetic samples with regularization for better generalization and stability. Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods, improving no-augmentation methods by 5+ average points, and outperforming augmentation methods by 3+ average points.

Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning

TL;DR

FewGen tackles the data scarcity hurdle in few-shot NLU by first training a label-conditioned generator on the limited data and then synthesizing large volumes of label-discriminative training samples. A meta-weighted generator objective automatically learns token-level weights to emphasize discriminative cues, while a noise-robust classifier fine-tunes on both real and synthetic data with label smoothing and temporal ensembling. Across seven GLUE tasks, FewGen outperforms non-augmentation baselines by over 5 points on average and exceeds augmentation methods by about 3 points, with ablations validating the importance of the meta-weighting and robust fine-tuning components. The approach demonstrates that carefully guided synthetic data can substantially close the gap between few-shot and fully supervised performance, offering a practical augmentation strategy for NLP under data constraints.

Abstract

Recent studies have revealed the intriguing few-shot learning ability of pretrained language models (PLMs): They can quickly adapt to a new task when fine-tuned on a small amount of labeled data formulated as prompts, without requiring abundant task-specific annotations. Despite their promising performance, most existing few-shot approaches that only learn from the small training set still underperform fully supervised training by nontrivial margins. In this work, we study few-shot learning with PLMs from a different perspective: We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples which augment the original training set. To encourage the generator to produce label-discriminative samples, we train it via weighted maximum likelihood where the weight of each token is automatically adjusted based on a discriminative meta-learning objective. A classification PLM can then be fine-tuned on both the few-shot and the synthetic samples with regularization for better generalization and stability. Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods, improving no-augmentation methods by 5+ average points, and outperforming augmentation methods by 3+ average points.
Paper Structure (48 sections, 15 equations, 4 figures, 15 tables, 2 algorithms)

This paper contains 48 sections, 15 equations, 4 figures, 15 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of FewGen. A generator PLM is first tuned on the few-shot samples with our proposed meta-weighted training objective and then used to synthesize new training samples. A classification PLM is finally trained on both the few-shot and the generated samples.
  • Figure 2: (On MNLI) Training the generator via $\mathcal{L}_{\text{gen}}$ does not automatically decrease $\mathcal{L}_{\text{disc}}$.
  • Figure 3: With different generator tuning objectives, (a) $\mathcal{L}_{\text{disc}}$ and (b) language modeling loss on the dev set.
  • Figure 4: Visualization of learned token weights on two samples from MNLI's few-shot training set. The generator is trained given the first sentence to generate the second. The tokens associated with higher weights are more label indicative.