Masked Mixers for Language Generation and Retrieval

Benjamin L. Badger

Masked Mixers for Language Generation and Retrieval

Benjamin L. Badger

TL;DR

The work investigates the limitations of attention-based transformers in language tasks by proposing masked mixers that replace self-attention with triangular-masked convolutions to improve global input representation, i.e., invertibility. It introduces a representation-invertibility test, a masked-mixer architecture, and retrieval pipelines, showing that masked mixers preserve input information far better than transformers and yield superior embeddings for retrieval, even with far less pretraining data. The results demonstrate that masked mixers can learn causal language modeling efficiently on small contexts while delivering strong retrieval performance via InfoNCE cosine-similarity training and autoencoder pretraining, though transformers may still edge them out on very large-context tasks depending on optimization and hardware. Overall, the paper suggests that invertibility-driven design and input-representation fidelity can drive practical gains in retrieval and certain generation tasks, challenging the necessity of attention for efficient language modeling and motivating further exploration of feedforward architectures and pretraining strategies.

Abstract

Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most input information is lost. In support of this idea we observe poor input representation accuracy in transformers and more accurate representation in what we term masked mixers, which replace self-attention with masked convolutions. The masked mixer learns causal language modeling more efficiently than early transformer implementations and even outperforms optimized, current transformers when training on small ($n_{ctx}<512$) but not larger context windows. Evidence is presented for the hypothesis that differences in transformer and masked mixer training efficiencies for various tasks are best predicted by input representation accuracy, or equivalently global invertibility. We hypothesize that the information loss exhibited by transformers would be more detrimental to retrieval than generation, as the former is more closely approximated by a bijective and thus invertible function. We find that masked mixers are more effective retrieval models both when the pretrained embedding model is unchanged as well as when the embedding model is modified via cosine similarity-based InfoNCE loss minimization. A small masked mixer is shown to outperform a large and near state-of-the-art transformer-based retrieval model, despite the latter being trained with many orders of magnitude more data and compute.

Masked Mixers for Language Generation and Retrieval

TL;DR

Abstract

) but not larger context windows. Evidence is presented for the hypothesis that differences in transformer and masked mixer training efficiencies for various tasks are best predicted by input representation accuracy, or equivalently global invertibility. We hypothesize that the information loss exhibited by transformers would be more detrimental to retrieval than generation, as the former is more closely approximated by a bijective and thus invertible function. We find that masked mixers are more effective retrieval models both when the pretrained embedding model is unchanged as well as when the embedding model is modified via cosine similarity-based InfoNCE loss minimization. A small masked mixer is shown to outperform a large and near state-of-the-art transformer-based retrieval model, despite the latter being trained with many orders of magnitude more data and compute.

Paper Structure (38 sections, 11 equations, 24 figures, 14 tables, 1 algorithm)

This paper contains 38 sections, 11 equations, 24 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Our Contribution
Accurate Self and Non-Self Token Representation in Masked Mixers
Input representation background
Masked mixer architecture
Masked mixers but not transformers exhibit accurate input representations
Masked Mixer and Transformer Causal Language Modeling Efficiency
Architecture optimizations
Autoregressive inference for masked mixers
Masked mixers are more efficient causal language modelers for small (<512 tokens) but not large-context training
Masked mixers are more efficient language learners than early transformer implementations
Alignment Between Model and Task Invertibility Predicts Training Efficiency
Neither dataset nor task stochasticity explain mixer versus transformer performance differences
Multi-token and many-token prediction efficiencies
...and 23 more sections

Figures (24)

Figure 1: Indirect input representation method applied to a Llama-style transformer.
Figure 2: Causal language modeling via masking convolutional weights.
Figure 3: Masked mixers exhibit more accurate input representation than transformers. All models are $n_l=8$ and all transformers are llama architectures with $n_h=32$. In d) the transformer is $d_m=256$ and mixer $d_m=512$, trained on TinyStories. Inputs are random samples from TinyStories.
Figure 4: FineWeb Causal Language Model Training.
Figure 5: GPT-1 and masked mixer training efficiencies for $n_{ctx}=512$.
...and 19 more figures

Masked Mixers for Language Generation and Retrieval

TL;DR

Abstract

Masked Mixers for Language Generation and Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (24)