Table of Contents
Fetching ...

Latent Normalizing Flows for Discrete Sequences

Zachary M. Ziegler, Alexander M. Rush

TL;DR

This work tackles the challenge of applying normalizing flows to discrete sequences by embedding a highly multimodal flow-based prior inside a VAE and emitting discrete observations from a simple, inputless decoder. It introduces three flow architectures (AF/AF, AF/SCF, IAF/SCF) and an extension with Non-Linear Squared (NLSq) flows to capture multimodal dynamics essential for discrete data. Experiments on character-level language modeling and polyphonic music modeling show that the latent-flow approach can approach autoregressive baselines while enabling non-autoregressive generation with speedups, albeit with some trade-offs in accuracy. The results highlight the potential of continuous latent representations to model discrete sequences and point to future directions in conditional and GAN-integrated frameworks.

Abstract

Normalizing flows are a powerful class of generative models for continuous random variables, showing both strong model flexibility and the potential for non-autoregressive generation. These benefits are also desired when modeling discrete random variables such as text, but directly applying normalizing flows to discrete sequences poses significant additional challenges. We propose a VAE-based generative model which jointly learns a normalizing flow-based distribution in the latent space and a stochastic mapping to an observed discrete space. In this setting, we find that it is crucial for the flow-based distribution to be highly multimodal. To capture this property, we propose several normalizing flow architectures to maximize model flexibility. Experiments consider common discrete sequence tasks of character-level language modeling and polyphonic music generation. Our results indicate that an autoregressive flow-based model can match the performance of a comparable autoregressive baseline, and a non-autoregressive flow-based model can improve generation speed with a penalty to performance.

Latent Normalizing Flows for Discrete Sequences

TL;DR

This work tackles the challenge of applying normalizing flows to discrete sequences by embedding a highly multimodal flow-based prior inside a VAE and emitting discrete observations from a simple, inputless decoder. It introduces three flow architectures (AF/AF, AF/SCF, IAF/SCF) and an extension with Non-Linear Squared (NLSq) flows to capture multimodal dynamics essential for discrete data. Experiments on character-level language modeling and polyphonic music modeling show that the latent-flow approach can approach autoregressive baselines while enabling non-autoregressive generation with speedups, albeit with some trade-offs in accuracy. The results highlight the potential of continuous latent representations to model discrete sequences and point to future directions in conditional and GAN-integrated frameworks.

Abstract

Normalizing flows are a powerful class of generative models for continuous random variables, showing both strong model flexibility and the potential for non-autoregressive generation. These benefits are also desired when modeling discrete random variables such as text, but directly applying normalizing flows to discrete sequences poses significant additional challenges. We propose a VAE-based generative model which jointly learns a normalizing flow-based distribution in the latent space and a stochastic mapping to an observed discrete space. In this setting, we find that it is crucial for the flow-based distribution to be highly multimodal. To capture this property, we propose several normalizing flow architectures to maximize model flexibility. Experiments consider common discrete sequence tasks of character-level language modeling and polyphonic music generation. Our results indicate that an autoregressive flow-based model can match the performance of a comparable autoregressive baseline, and a non-autoregressive flow-based model can improve generation speed with a penalty to performance.

Paper Structure

This paper contains 29 sections, 23 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Flow diagrams for normalizing flows acting on sequences of scalars. Circles represent random variables $\epsilon_d$ or $z_d$. Diamonds represent a parameterized invertible scalar transformation, $f_\theta$, in this case an affine transformation. Diagrams show the sampling process ($\boldsymbol{\epsilon} \rightarrow \boldsymbol{z}$, read left to right) and density evaluation ( $\boldsymbol{\epsilon} \leftarrow \boldsymbol{z}$, read right to left). While all models can be used in both directions, they differ in terms of whether the calculation is serial or parallel, i.e. AF is parallel in evaluation but serial in sampling ($\leftarrow$) because $z_1$ is needed to sample $z_2$, whereas SCF is parallel for both ($\leftrightarrow$).
  • Figure 2: Proposed generative model of discrete sequences. The model first samples a sequence length $T$ and then a latent continuous sequence $\boldsymbol{z}_{1:T}$. Each $x_t$ is shown separately to highlight their conditional independence given $\boldsymbol{z}_{1:T}$. Normalizing flow specifics are abstracted by $p(\boldsymbol{z})$ are described in Section \ref{['sec:prior']}.
  • Figure 3: Example conditional distributions $p(x_t|\boldsymbol{x}_{<t})$ from continuous (PixelCNN++, 10 mixture components, trained on CIFAR-10, top) and discrete (LSTM char-level LM trained on PTB, bottom) autoregressive models.
  • Figure 4: Normalizing flows acting on $T$x$H$ random variables proposed in this work. Circles with variables represent random vectors of size $H$. Bold diamonds each represent a multilayer AF ($\leftarrow$) or a multilayer SCF ($\leftrightarrow$), as in Figure \ref{['fig:standard_flows']}d. Arrows to a bold diamond represent additional dependencies to all affine transformations within the indicated AF or SCF. As above the (arrows) point to the parallel direction, i.e. (a) is parallel in density evaluation whereas (c) is parallel in sampling.
  • Figure 5: Non-Linear Squared (NLSq) flow for multimodal distribution modeling. (a, b) NLSq transformation defined by hand-selecting 4 layers of flow parameters, (a) composed transformation, (b) base density (red), final density (blue). (c) Resulting density for learned 2D transformation via 5 layer AF-like using the NLSq flow from a standard Gaussian to a Gaussian mixture distribution.
  • ...and 2 more figures