Table of Contents
Fetching ...

Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

Chunyuan Li, Xiang Gao, Yuan Li, Baolin Peng, Xiujun Li, Yizhe Zhang, Jianfeng Gao

TL;DR

Optimus introduces a large-scale pre-trained latent-variable language model that learns a universal sentence latent space via VAE objectives, enabling both guided generation and robust low-resource understanding. By grounding a BERT-like encoder and a GPT-2-like decoder in a shared latent space, Optimus achieves stronger representation learning, mitigates KL-vanishing through pre-training, and supports controllable generation via latent-space arithmetic and interpolation. The approach yields state-of-the-art results on VAE language modeling benchmarks, improves dialog and stylized text generation, and demonstrates notable benefits in few-shot or low-resource understanding settings. This work suggests that pre-training a meaningful latent space can make deep generative models more practical and versatile for NLP tasks in the modern pre-trained language modeling era.

Abstract

When trained effectively, the Variational Autoencoder (VAE) can be both a powerful generative model and an effective representation learning framework for natural language. In this paper, we propose the first large-scale language VAE model, Optimus. A universal latent embedding space for sentences is first pre-trained on large text corpus, and then fine-tuned for various language generation and understanding tasks. Compared with GPT-2, Optimus enables guided language generation from an abstract level using the latent vectors. Compared with BERT, Optimus can generalize better on low-resource language understanding tasks due to the smooth latent space structure. Extensive experimental results on a wide range of language tasks demonstrate the effectiveness of Optimus. It achieves new state-of-the-art on VAE language modeling benchmarks. We hope that our first pre-trained big VAE language model itself and results can help the NLP community renew the interests of deep generative models in the era of large-scale pre-training, and make these principled methods more practical.

Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

TL;DR

Optimus introduces a large-scale pre-trained latent-variable language model that learns a universal sentence latent space via VAE objectives, enabling both guided generation and robust low-resource understanding. By grounding a BERT-like encoder and a GPT-2-like decoder in a shared latent space, Optimus achieves stronger representation learning, mitigates KL-vanishing through pre-training, and supports controllable generation via latent-space arithmetic and interpolation. The approach yields state-of-the-art results on VAE language modeling benchmarks, improves dialog and stylized text generation, and demonstrates notable benefits in few-shot or low-resource understanding settings. This work suggests that pre-training a meaningful latent space can make deep generative models more practical and versatile for NLP tasks in the modern pre-trained language modeling era.

Abstract

When trained effectively, the Variational Autoencoder (VAE) can be both a powerful generative model and an effective representation learning framework for natural language. In this paper, we propose the first large-scale language VAE model, Optimus. A universal latent embedding space for sentences is first pre-trained on large text corpus, and then fine-tuned for various language generation and understanding tasks. Compared with GPT-2, Optimus enables guided language generation from an abstract level using the latent vectors. Compared with BERT, Optimus can generalize better on low-resource language understanding tasks due to the smooth latent space structure. Extensive experimental results on a wide range of language tasks demonstrate the effectiveness of Optimus. It achieves new state-of-the-art on VAE language modeling benchmarks. We hope that our first pre-trained big VAE language model itself and results can help the NLP community renew the interests of deep generative models in the era of large-scale pre-training, and make these principled methods more practical.

Paper Structure

This paper contains 51 sections, 14 equations, 6 figures, 20 tables.

Figures (6)

  • Figure 1: Illustration of Optimus architecture.
  • Figure 2: Illustration of two schemes to inject latent vector. (a) Memory: $x_t$ attends both $x_{<t}$ and ${\boldsymbol{h}}_{\texttt{Mem}}$; (b) Embedding: latent embedding is added into old embeddings to construct new token embedding ${\boldsymbol{h}}_{\texttt{Emb}}^{\prime}$.
  • Figure 3: Testing accuracy with a varying number of labeled training samples per class on the $\mathtt{Yelp}$ dataset.
  • Figure 4: Comparison of tSNE visualization for the learned features. The colors indicate different labels.
  • Figure 5: Illustration of three different schemes to inject latent vector into GPT-2 for guided language generation: (a) Yelp and (b) PTB. The learning curves for reconstruction error per word is considered. Emb indicates latent vector is used as additional embedding to add into other embeddings, and Mem indicates latent vector is used as additional memory token for GPT2 to attend. Mem+Emb indicates the integration of two schemes.
  • ...and 1 more figures