Table of Contents
Fetching ...

Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

Tao Tao, Maissam Barkeshli

TL;DR

This work probes whether Transformer models can learn the hidden recurrence and permutation structures of PCGs, a nontrivial class of PRNGs, by evaluating in-context prediction across multiple PCG variants and moduli up to $m=2^{22}$. The authors show that Transformers achieve strong in-context learning, even when outputs are truncated, and identify scaling laws, such as the required context growing approximately as $\tfrac{1}{2}\sqrt{m}$, with curriculum learning and pretrained initialization crucial for large moduli. They also demonstrate that curriculum strategies, data mixing, and transfer from smaller-modulus models enable stable scaling under fixed compute budgets, and reveal interpretable, rotation-invariant token embeddings alongside generator-separation dynamics in intermediate layers. Together, the results provide insights into how transformers encode modular arithmetic tasks, with implications for cryptography-inspired benchmarks, curriculum design, and model interpretability in structured arithmetic settings.

Abstract

We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

TL;DR

This work probes whether Transformer models can learn the hidden recurrence and permutation structures of PCGs, a nontrivial class of PRNGs, by evaluating in-context prediction across multiple PCG variants and moduli up to . The authors show that Transformers achieve strong in-context learning, even when outputs are truncated, and identify scaling laws, such as the required context growing approximately as , with curriculum learning and pretrained initialization crucial for large moduli. They also demonstrate that curriculum strategies, data mixing, and transfer from smaller-modulus models enable stable scaling under fixed compute budgets, and reveal interpretable, rotation-invariant token embeddings alongside generator-separation dynamics in intermediate layers. Together, the results provide insights into how transformers encode modular arithmetic tasks, with implications for cryptography-inspired benchmarks, curriculum design, and model interpretability in structured arithmetic settings.

Abstract

We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to using up to million model parameters and datasets with up to billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus : the number of in-context sequence elements required for near-perfect prediction grows as . For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

Paper Structure

This paper contains 46 sections, 13 equations, 27 figures, 1 table.

Figures (27)

  • Figure 1: Depiction of PCG protocols at $m=2^{16}$ with 8-bit output. Left: XSLRR-16/8. (a) State $s_i$. The top 3 bits are control bits. (b) $s_i$ is right-shifted by 8 bits. (c) The shifted state is XORed with $s_i$. (d) The lower 8 bits are retained and rotated right by the value of the control bits to produce the output. Middle: XSHRR-16/8. (e) State $s_i$, with the top 3 bits as control; the lowest few bits are unused. (f) $s_i$ is right-shifted by 5 bits. (g) The shifted state is XORed with $s_i$. (h) The upper 8 bits immediately following the control bits are retained and rotated right by the control bits to produce the output. Right: XSHRS-16/8. (i) State $s_i$, with the top 2 bits as control bits. (j) $s_i$ is right-shifted by 3 bits. (k) The shifted state is XORed with $s_i$. (l) Starting from after the control bits, the output window is right-shifted by the control bits, producing the output.
  • Figure 2: (a) Test accuracy at the 512th token during training on combined datasets of diverse PRNG variants. (b) Accuracy during training on XSLRR-16/8 dataset. “512th” refers to the model’s prediction accuracy at the 512-th token. “Avg” denotes accuracy averaged across all token positions. (c) Final test accuracy by position index for combined training. (d) Final test accuracy when trained separately on each generator type, where all variants achieve near 100% accuracy with only 128 in-context elements.
  • Figure 3: Left: Prediction accuracy at the 64th, 128th, and 256th sequence positions as a function of bits kept ($k$) in truncated LCGs with $m=2^{16}$. Accuracy improves with larger $k$ and longer context, remaining far above the random baseline $1/2^k$ even under severe truncation. Middle: For XSLRR, accuracy improves stepwise as more context is observed, with reliable predictions emerging once the context length reaches exactly $0.5\sqrt{m}$ elements. Right: Context length required to exceed 90% test accuracy scales as $\tfrac{1}{2}\sqrt{m}$ with modulus $m$.
  • Figure 4: Scaling studies of dataset size and model capacity. Left: Prediction accuracy at positions 64, 128, and 256 as a function of dataset size ($n_a \times n_c$ sequences). Accuracy improves rapidly with larger datasets and saturates once sufficient diversity is reached. Middle and Right: Test accuracy heatmaps across model depth ($n_\text{layers}$) and number of heads ($n_\text{heads}$), evaluated at positions 128 and 256. Larger models achieve higher accuracy, with nearly perfect prediction at 128 positions once $n_\text{layers} \geq 4$ and $n_{\text{heads}} \geq 8$.
  • Figure 5: Effect of mixing smaller-modulus data on training stability and final accuracy. (a) Test loss on $m{=}2^{18}$ under three training setups: training only on $m{=}2^{18}$ (blue), fixed mixing with $\alpha{=}0.2$ (orange), and curriculum mixing starting at $\alpha{=}0.2$ and decaying to $0$ over 40k steps (green). (b,c) Learning-rate and weight-decay landscapes for 256th-token accuracy on $m{=}2^{18}$, comparing training solely on $m{=}2^{18}$ (b) versus with the curriculum (c). (d) Test accuracy at the 256th token for both $m{=}2^{16}$ and $m{=}2^{18}$ under fixed mixing and curriculum mixing as the initial $\alpha$ varies.
  • ...and 22 more figures