Table of Contents
Fetching ...

VOLTA: Improving Generative Diversity by Variational Mutual Information Maximizing Autoencoder

Yueen Ma, Dafeng Chi, Jingjing Li, Kai Song, Yuzheng Zhuang, Irwin King

TL;DR

VOLTA tackles the lack of diversity in Transformer-based NLG by inserting a variational latent space into the decoder through a novel cross-attention-based connection, and by augmenting it with InfoGAN-style latent codes for input-independent variability. The model supports both continuous and discrete latent variables and can operate with decoder-only or encoder-decoder Transformers, using a unified objective that combines autoencoding loss, latent-variable regularization, and a VMIM term to maximize mutual information between latent codes and generated outputs. Empirical results across six datasets and three NLG tasks show that VOLTA substantially improves generative diversity while preserving or enhancing quality, outperforming established VAE-Transformer baselines and demonstrating stable optimization. The work provides a scalable, architecture-agnostic approach to controllable, diverse generation with practical applicability to QA, language modeling, and dialog systems.

Abstract

The natural language generation domain has witnessed great success thanks to Transformer models. Although they have achieved state-of-the-art generative quality, they often neglect generative diversity. Prior attempts to tackle this issue suffer from either low model capacity or over-complicated architectures. Some recent methods employ the VAE framework to enhance diversity, but their latent variables fully depend on the input context, restricting exploration of the latent space. In this paper, we introduce VOLTA, a framework that elevates generative diversity by bridging Transformer with VAE via a more effective cross-attention-based connection, departing from conventional embedding concatenation or summation. Additionally, we propose integrating InfoGAN-style latent codes to enable input-independent variability, further diversifying the generation. Moreover, our framework accommodates discrete inputs alongside its existing support for continuous inputs. We perform comprehensive experiments with two types of Transformers on six datasets from three different NLG tasks to show that our approach can significantly improve generative diversity while maintaining generative quality.

VOLTA: Improving Generative Diversity by Variational Mutual Information Maximizing Autoencoder

TL;DR

VOLTA tackles the lack of diversity in Transformer-based NLG by inserting a variational latent space into the decoder through a novel cross-attention-based connection, and by augmenting it with InfoGAN-style latent codes for input-independent variability. The model supports both continuous and discrete latent variables and can operate with decoder-only or encoder-decoder Transformers, using a unified objective that combines autoencoding loss, latent-variable regularization, and a VMIM term to maximize mutual information between latent codes and generated outputs. Empirical results across six datasets and three NLG tasks show that VOLTA substantially improves generative diversity while preserving or enhancing quality, outperforming established VAE-Transformer baselines and demonstrating stable optimization. The work provides a scalable, architecture-agnostic approach to controllable, diverse generation with practical applicability to QA, language modeling, and dialog systems.

Abstract

The natural language generation domain has witnessed great success thanks to Transformer models. Although they have achieved state-of-the-art generative quality, they often neglect generative diversity. Prior attempts to tackle this issue suffer from either low model capacity or over-complicated architectures. Some recent methods employ the VAE framework to enhance diversity, but their latent variables fully depend on the input context, restricting exploration of the latent space. In this paper, we introduce VOLTA, a framework that elevates generative diversity by bridging Transformer with VAE via a more effective cross-attention-based connection, departing from conventional embedding concatenation or summation. Additionally, we propose integrating InfoGAN-style latent codes to enable input-independent variability, further diversifying the generation. Moreover, our framework accommodates discrete inputs alongside its existing support for continuous inputs. We perform comprehensive experiments with two types of Transformers on six datasets from three different NLG tasks to show that our approach can significantly improve generative diversity while maintaining generative quality.
Paper Structure (35 sections, 27 equations, 2 figures, 9 tables)

This paper contains 35 sections, 27 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: The overview of VOLTA. The encoder encodes the context into VAE latent variables. The variables, augmented with InfoGAN-style latent codes, can be continuous or discrete based on the input type. Subsequently, they are connected to the decoder through the cross-attention mechanism. Leveraging the variability inherent in the latent space, the decoder generates diverse content conditioned on the context.
  • Figure 2: T-SNE visualization comparing question embeddings from GPT-2 with latent variable embeddings by VOLTA. Points of the same color depict embeddings from the identical context. VOLTA showcases diverse embeddings for each context, contrasting the deterministic nature of a vanilla LM.