Table of Contents
Fetching ...

Unsupervised Text Embedding Space Generation Using Generative Adversarial Networks for Text Synthesis

Jun-Min Lee, Tae-Bin Ha

TL;DR

TESGAN introduces a novel unsupervised framework for text synthesis that generates continuous text embedding seeds instead of discrete tokens, enabling gradient-based learning and reducing memorization. A seed interpretation model maps seeds into sentences, while two seed-focused discriminators enforce structural realism, complemented by SDP and SFP auxiliary losses to align distributions and seed frames. Evaluations on DailyDialog and IMDb show favorable Fréchet BERT Distance and data memorization metrics, competitive diversity, and strong human judgments, demonstrating the potential of continuous embedding spaces for robust text generation. The approach points toward synergistic futures with Large Language Models by treating text as a learnable continuous space rather than purely discrete tokens.

Abstract

Generative Adversarial Networks (GAN) is a model for data synthesis, which creates plausible data through the competition of generator and discriminator. Although GAN application to image synthesis is extensively studied, it has inherent limitations to natural language generation. Because natural language is composed of discrete tokens, a generator has difficulty updating its gradient through backpropagation; therefore, most text-GAN studies generate sentences starting with a random token based on a reward system. Thus, the generators of previous studies are pre-trained in an autoregressive way before adversarial training, causing data memorization that synthesized sentences reproduce the training data. In this paper, we synthesize sentences using a framework similar to the original GAN. More specifically, we propose Text Embedding Space Generative Adversarial Networks (TESGAN) which generate continuous text embedding spaces instead of discrete tokens to solve the gradient backpropagation problem. Furthermore, TESGAN conducts unsupervised learning which does not directly refer to the text of the training data to overcome the data memorization issue. By adopting this novel method, TESGAN can synthesize new sentences, showing the potential of unsupervised learning for text synthesis. We expect to see extended research combining Large Language Models with a new perspective of viewing text as an continuous space.

Unsupervised Text Embedding Space Generation Using Generative Adversarial Networks for Text Synthesis

TL;DR

TESGAN introduces a novel unsupervised framework for text synthesis that generates continuous text embedding seeds instead of discrete tokens, enabling gradient-based learning and reducing memorization. A seed interpretation model maps seeds into sentences, while two seed-focused discriminators enforce structural realism, complemented by SDP and SFP auxiliary losses to align distributions and seed frames. Evaluations on DailyDialog and IMDb show favorable Fréchet BERT Distance and data memorization metrics, competitive diversity, and strong human judgments, demonstrating the potential of continuous embedding spaces for robust text generation. The approach points toward synergistic futures with Large Language Models by treating text as a learnable continuous space rather than purely discrete tokens.

Abstract

Generative Adversarial Networks (GAN) is a model for data synthesis, which creates plausible data through the competition of generator and discriminator. Although GAN application to image synthesis is extensively studied, it has inherent limitations to natural language generation. Because natural language is composed of discrete tokens, a generator has difficulty updating its gradient through backpropagation; therefore, most text-GAN studies generate sentences starting with a random token based on a reward system. Thus, the generators of previous studies are pre-trained in an autoregressive way before adversarial training, causing data memorization that synthesized sentences reproduce the training data. In this paper, we synthesize sentences using a framework similar to the original GAN. More specifically, we propose Text Embedding Space Generative Adversarial Networks (TESGAN) which generate continuous text embedding spaces instead of discrete tokens to solve the gradient backpropagation problem. Furthermore, TESGAN conducts unsupervised learning which does not directly refer to the text of the training data to overcome the data memorization issue. By adopting this novel method, TESGAN can synthesize new sentences, showing the potential of unsupervised learning for text synthesis. We expect to see extended research combining Large Language Models with a new perspective of viewing text as an continuous space.
Paper Structure (40 sections, 15 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 40 sections, 15 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of the seed interpretation model. The seed interpretation model is pre-trained with multi-turn sentences before adversarial training (left). After pre-training, the model's parameters are frozen, allowing it to synthesize text from the seed. The right figure implies that text can be synthesized from the seed. The $[PAD]$ tokens following the $[SEP]$ tokens are omitted in the left part for clarity.
  • Figure 2: Illustration of text synthesizing method using the seed interpretation model in the inference phase.
  • Figure 3: Illustration of the generator. P-TESGAN makes perturbed seeds by adding zero-centered normal distribution noise $z$ (gray) to the output (blue) from the generator.
  • Figure 4: Illustrations of the two discriminators. SSD predicts whether the seed is real or fake using the $[CLS]$ special token's feature. SOD considers both forward and backward contexts of the seed.
  • Figure 5: Illustration of Seed Distribution Prediction (SDP). SDP is used to enhance the fake seeds of the generator during adversarial training by minimizing the distance between real and fake seed distributions.
  • ...and 5 more figures