σ-GPTs: A New Approach to Autoregressive Models

Arnaud Pannatier; Evann Courdier; François Fleuret

σ-GPTs: A New Approach to Autoregressive Models

Arnaud Pannatier, Evann Courdier, François Fleuret

TL;DR

Sigma-GPT introduces shuffled autoregression by training a Transformer on randomly permuted sequences and augmenting it with dual positional encodings, enabling on-demand control of generation order. The approach supports conditional density estimation, infilling, and token-based rejection sampling to generate sequences in bursts with a dynamically varying number of steps. Across language modeling, maze path solving, and aircraft vertical-rate prediction, sigma-GPT shows competitive performance with left-to-right GPT and advantages over diffusion baselines in several settings, while incurring higher training complexity. The work demonstrates that order-agnostic generation yields practical benefits for conditioning and rapid sampling, with a clear impact on tasks requiring flexible inference and efficient generation pipelines.

Abstract

Autoregressive models, such as the GPT family, use a fixed order, usually left-to-right, to generate sequences. However, this is not a necessity. In this paper, we challenge this assumption and show that by simply adding a positional encoding for the output, this order can be modulated on-the-fly per-sample which offers key advantageous properties. It allows for the sampling of and conditioning on arbitrary subsets of tokens, and it also allows sampling in one shot multiple tokens dynamically according to a rejection strategy, leading to a sub-linear number of model evaluations. We evaluate our method across various domains, including language modeling, path-solving, and aircraft vertical rate prediction, decreasing the number of steps required for generation by an order of magnitude.

σ-GPTs: A New Approach to Autoregressive Models

TL;DR

Abstract

Paper Structure (32 sections, 8 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 8 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Contributions:
Methodology
sigma-GPTs: Shuffled Autoregression
Double Positional Encodings
Conditional Probabilities and Infilling
Token-based Rejection Sampling
Other Orders
Denoising Diffusion Models
Results
General performance
Training Efficiency
Curriculum Learning
Open Text Generation: t-SNE of Generated Sequences
Training and Generating in Fractal Order
...and 17 more sections

Figures (7)

Figure 1: In our $\sigma$-GPT, an arbitrary shuffling order $\sigma$ can be chosen on-the-fly for every sample. It induces an input order $0,\sigma(1), \sigma(2), \dots$ and an output order $\sigma(1),\sigma(2), \sigma(3), \dots$, where the input is first padded with a $0$ to ensure a consistent number of tokens. Tokens are shuffled accordingly, and these orders are both encoded separately with two positional encodings concatenated to the input, allowing the model to sample consistently in the autoregressive process. The output is finally shuffled back to the true order.
Figure 2: (Left.) We can infill the sequence by conditioning on the known part (black points). (Right.) We can also have estimates of the density at any point of the sequence.
Figure 3: Conditional density estimation and infilling on the maze path-solving task.
Figure 4: 2D t-SNE of text-small-3-embeddings of 3000 sequences generated by each method. We compute the t-SNE of all the embeddings together, and then we display in each graph the embeddings of the validation set (green), the embeddings of the corresponding method (blue), and the embeddings of the other methods (gray). We see that the embeddings of the generated sequences have the same overall distribution compared to validation sets, which seems to indicate that $GPT$, $\sigma$-GPT, $\sigma$-GPT with burst-sampling, and diffusion models can generate sequences of similar quality.
Figure 5: Number of examples needed to switch from memorization to generalization. The model is trained on a restricted dataset size in the path-finding task. We see that the model trained in a random order needs more examples to switch from memorization to generalization. At 1k samples both models are fully in a memorization regime, at 100k both generalize but in between, at 10k, the model trained in a random order is still in a memorization regime.
...and 2 more figures

σ-GPTs: A New Approach to Autoregressive Models

TL;DR

Abstract

σ-GPTs: A New Approach to Autoregressive Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)