Transformer Neural Autoregressive Flows

Massimiliano Patacchiola; Aliaksandra Shysheya; Katja Hofmann; Richard E. Turner

Transformer Neural Autoregressive Flows

Massimiliano Patacchiola, Aliaksandra Shysheya, Katja Hofmann, Richard E. Turner

TL;DR

Transformer Neural Autoregressive Flows (T-NAFs) address scalability and training instability in neural autoregressive density estimators by using a Transformer as an autoregressive conditioner that outputs the parameters of an invertible transformation, with densities computed via $p_Y(y)=p_X(x)\,|\det J_f(x)|^{-1}$. They treat each variable dimension as an input token and enforce autoregression with attention masking, achieving strong expressivity with shared parameters across dimensions. Across UCI benchmarks and BSDS300, T-NAFs match or outperform NAFs/B-NAFs while using a single flow and substantially fewer parameters, demonstrating improved parameter efficiency. The work highlights the potential of Transformer-based conditioning to scale neural autoregressive flows and suggests future optimizations of attention-based computation for density estimation.

Abstract

Density estimation, a central problem in machine learning, can be performed using Normalizing Flows (NFs). NFs comprise a sequence of invertible transformations, that turn a complex target distribution into a simple one, by exploiting the change of variables theorem. Neural Autoregressive Flows (NAFs) and Block Neural Autoregressive Flows (B-NAFs) are arguably the most perfomant members of the NF family. However, they suffer scalability issues and training instability due to the constraints imposed on the network structure. In this paper, we propose a novel solution to these challenges by exploiting transformers to define a new class of neural flows called Transformer Neural Autoregressive Flows (T-NAFs). T-NAFs treat each dimension of a random variable as a separate input token, using attention masking to enforce an autoregressive constraint. We take an amortization-inspired approach where the transformer outputs the parameters of an invertible transformation. The experimental results demonstrate that T-NAFs consistently match or outperform NAFs and B-NAFs across multiple datasets from the UCI benchmark. Remarkably, T-NAFs achieve these results using an order of magnitude fewer parameters than previous approaches, without composing multiple flows.

Transformer Neural Autoregressive Flows

TL;DR

. They treat each variable dimension as an input token and enforce autoregression with attention masking, achieving strong expressivity with shared parameters across dimensions. Across UCI benchmarks and BSDS300, T-NAFs match or outperform NAFs/B-NAFs while using a single flow and substantially fewer parameters, demonstrating improved parameter efficiency. The work highlights the potential of Transformer-based conditioning to scale neural autoregressive flows and suggests future optimizations of attention-based computation for density estimation.

Abstract

Paper Structure (13 sections, 13 equations, 2 figures, 2 tables)

This paper contains 13 sections, 13 equations, 2 figures, 2 tables.

Introduction
Previous work
Background
Normalizing Flows
Autoregressive Flows
Neural Autoregressive Flows
Transformer Neural Autoregressive Flows
Overview of the architecture
Transformation
Experiments
Density Estimation
Ablations
Conclusions

Figures (2)

Figure 1: Graphical representation of a T-NAF model. Left: the architecture includes a transformer neural network conditioner $\text{TN}$ and an invertible transformation $t$. In $\text{TN}(x_1, \dots, x_D, i; \boldsymbol{\theta})$ each dimension of the random variable is linearly projected to get an embedding which is added to a learnable position vector. A sequence of $L$ transformer layers is used to produces hidden embeddings $\mathbf{h}_1, \dots, \mathbf{h}_D$ that are passed through a projection head to generate pseudo-parameters $\boldsymbol{\psi}_1, \dots, \boldsymbol{\psi}_D$. The pseudo-parameters are used as part of an invertible transformation $t(x_i; \boldsymbol{\psi}_i)$. Right: detailed schematic of a transformer encoder layer. Inputs are passed through a normalization layer, a Multy Head Attention (MHA) layer with autoregressive mask, another normalization layer, and an MLP.
Figure 2: Trade-off between number of parameters and performance. Vertical axis represents the number of parameters in log-scale (lower is better), and horizontal axis represent the log-likelihood on the test set (higher is better). The optimal trade-off is represented by points in the bottom-right corner. Overall T-NAF offers a better trade-off w.r.t. B-NAF; the gap in number of parameters between T-NAF and B-NAF gets larger as the number of input dimensions $D$ increases.

Transformer Neural Autoregressive Flows

TL;DR

Abstract

Transformer Neural Autoregressive Flows

Authors

TL;DR

Abstract

Table of Contents

Figures (2)