Table of Contents
Fetching ...

Not Enough Data? Deep Learning to the Rescue!

Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, Naama Zwerdling

TL;DR

Scarce labeled data hampers text classification performance. The authors introduce LAMBADA, a data augmentation framework that fine-tunes GPT-2 on small labeled sets to synthesize labeled sentences and then filters them with a baseline classifier to ensure quality. Across ATIS, TREC, and WVA, LAMBADA yields statistically significant gains over baselines and competing generative methods, especially at very small data sizes, and can outperform unlabeled-data approaches. The work demonstrates the practical value of leveraging pre-trained language models for targeted, label-conditioned data generation combined with conservative filtering to improve discriminative classifiers.

Abstract

Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers' performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.

Not Enough Data? Deep Learning to the Rescue!

TL;DR

Scarce labeled data hampers text classification performance. The authors introduce LAMBADA, a data augmentation framework that fine-tunes GPT-2 on small labeled sets to synthesize labeled sentences and then filters them with a baseline classifier to ensure quality. Across ATIS, TREC, and WVA, LAMBADA yields statistically significant gains over baselines and competing generative methods, especially at very small data sizes, and can outperform unlabeled-data approaches. The work demonstrates the practical value of leveraging pre-trained language models for targeted, label-conditioned data generation combined with conservative filtering to improve discriminative classifiers.

Abstract

Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers' performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.

Paper Structure

This paper contains 30 sections, 3 equations, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: Accuracy for each sample size over ATIS