Learning Randomized Algorithms with Transformers

Johannes von Oswald; Seijin Kobayashi; Yassir Akram; Angelika Steger

Learning Randomized Algorithms with Transformers

Johannes von Oswald, Seijin Kobayashi, Yassir Akram, Angelika Steger

TL;DR

The paper investigates teaching transformers to perform randomized algorithms by injecting seed-based randomness at inputs and training with adversarial objectives. A q-norm based objective and multiple seeds enable transformers to learn robust, randomized strategies that outperform deterministic baselines under adversarial inputs, with majorities across seeds amplifying success. Across associative recall, graph coloring, and grid-world exploration, learned randomness emerges and yields superior worst-case performance and robust behavior, illustrating a principled bridge between randomized algorithms and neural networks. This approach highlights a path toward robustness against adversaries and suggests broader applicability, albeit with notable computational costs and scalability considerations.

Abstract

Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks' computation and predictions.

Learning Randomized Algorithms with Transformers

TL;DR

Abstract

Paper Structure (25 sections, 3 theorems, 10 equations, 10 figures, 3 tables)

This paper contains 25 sections, 3 theorems, 10 equations, 10 figures, 3 tables.

Introduction
Theoretical Considerations
Excessive model capacity will not enforce randomness
Randomization is not beneficial in expectation
Randomization can be beneficial in adversarial settings.
Our training objective.
Experimental Results
Evaluation protocol and baselines
Randomized transformers solve associative recall
Randomized transformers can solve graph coloring problems
Randomized transformer agents explore grid worlds
Discussion & Related work
Acknowledgment
Appendix
Proof of Proposition \ref{['prop:main']}
...and 10 more sections

Key Result

Proposition 1

[Randomization can be beneficial in worst-case scenarios] Assume that $\mathcal{X}$ is a compact set of $\mathbb{R}^d$ for some $d$, and that $L$ is continuous. Furthermore assume that there exist a parameter $\theta^*$ and a set of random seeds $(r_i)_{1\leq i \leq N} \in \mathcal{R}^N$ such that f Then there exists a randomized model with strictly smaller loss $\mathcal{L}^A$ than any determinis

Figures (10)

Figure 1: Solving associative recall tasks - the randomized way. A): A causal transformer model, with linear self-attention layers, is trained to remember $N$ value vectors each associated with a unique key. Given finite memory of size $M<N$, the model needs to decide which data to memorize in order to return the correct value when queried with some $k_i$. An algorithm that deterministically chooses what to remember will perform poorly as simple adversarial strategies will break retrieval. On the other hand, randomly deciding what to store protects the algorithm against such worst-case scenarios. To have a transformer learn such a randomized strategy we train it with an adversarial objective and , furthermore, diverge from usual machine learning setups and provide randomness, denoted as a seed, as an additional input to the model. B): Histogram showing the fraction of possible inputs, all of length $N=20$, with varying recall success rates. Left: We train two different models: 1) a deterministic one where we fix the input seed and 2) a randomized one with varying input seed. While the deterministic model consistently fails on some fraction of the inputs, the randomized model stores in memory varying parts of the sequence leveraging the provided seed, thus succeeding with non-zero probability on all possible recall tasks. Right: At inference time, majority voting can be used on the randomized model evaluated on several seeds, thus amplifying the success rate. Enough seeds leads to optimal recall. C): Recall success rate for worst-case sequences of each model when trained on various input lengths. For larger sequence lengths, the deterministic model fails to recall on adversarial inputs and the worst case success rate thus drops to zero. On the other hand, randomized models have a non-zero success rate on all inputs. Strikingly, when amplifying the success probability with majority voting, the randomized model succeeds in recalling virtually all inputs.
Figure 2: Left: Variance of the predictive probability conditioned on the input sequence, w.r.t. increasing $q$: Larger $q$ leads to randomized transformer models with non-zero output variance over the seeds, with a phase transition around $q \approx 16$. Right: Histogram showing the fraction of inputs, all of length $N=20$, with varying recall success rates, when training $A^q_{R}$ with various $q$. For ERM training i.e. $q=1$, we see the transformer producing essentially binary predictions, i.e.the predictions, over seeds r, are either correct or incorrect. For $q> 1$, we see randomization emerging, with non-zero success rate on all input for $q=32$.
Figure 3: Associative recall analyses. A): Performance of models trained on ERM does not improve when provided with additional seeds, cf. overlapping lines of $A^q_{r_0}$ and $A^q_{R}$. B): Models trained with our objective ($q=100$) do exhibit drastic improvements, especially with majority voting, compared to models trained with a single fixed seed in their input. C): Influence when training with different $q$ and $m$, measured on the loss with $q=1$. First, models show gradual improvement when increasing $q$, with randomness emerging over a certain threshold. Second, when fixing $q=100$, outer right plot, we observe already for small $m > 1$ improvements over the deterministic counterpart.
Figure 4: Transformers solving graph coloring problems. A): We study distributed vertex coloring problems on cycles $C_n$ where every vertex can only communicate with its immediate neighbors. B): A locally masked transformer, which can only attend to the immediate neighboring vertices on the graph by an appropriate attention mask, has to decide on the vertex color. C): If the transformer realizes a fixed mapping of vertex id $v_i$ to a color, an adversary can provide an input permutation for which the transformer fails to generate a correct coloring. However, when trained with our objective, the transformer model, to protect itself against this adversary, implements a randomized strategy. The coloring computed by the transformer for a fixed graph, now depends on the random seed and will fail in some and be correct in others. With the probability of being correct hopefully being large.
Figure 5: Graph coloring performance analyses. A): Performance of models trained on $\mathcal{L}^q$ for varying value of $q$. As $q$ increases, $A^q_{R}$ learns to leverage randomness to implement a randomized algorithm, inducing large improvements of worst-case performance compared to the deterministic counterpart $A^q_{r_0}$. Furthermore, majority voting boosts performance to close to optimal performance (green). Both the average and percentile curves are computed over all possible cycles of size N=10. B): Influence of varying $q$ during training evaluated on the loss with $q=1$. After training with $q> 3$ a clear advantage of the randomized model is observed. C): Histogram over all possible inputs, of the transformers success probability when varying $q$ when training $A^q_R$. When increasing $q$ we observe non-zero success rate due to randomization.
...and 5 more figures

Theorems & Definitions (8)

Definition 1: Oblivious adversary
Definition 2: Min-max loss
Proposition 1
Proposition 1: Randomization can be beneficial in worst-case scenarios
proof
Definition 3
Definition 4
Proposition 2

Learning Randomized Algorithms with Transformers

TL;DR

Abstract

Learning Randomized Algorithms with Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (8)