Table of Contents
Fetching ...

Avey-B

Devang Acharya, Mohammad Hammoud

TL;DR

This paper reformulate Avey for the encoder-only paradigm and proposes several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression.

Abstract

Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

Avey-B

TL;DR

This paper reformulate Avey for the encoder-only paradigm and proposes several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression.

Abstract

Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.
Paper Structure (33 sections, 2 theorems, 9 equations, 5 figures, 15 tables)

This paper contains 33 sections, 2 theorems, 9 equations, 5 figures, 15 tables.

Key Result

Proposition A.1

For a fixed target row $i$ and any two embeddings $j_1,j_2\in\{1,\ldots,C\}$: Consequently, a more relevant token (higher similarity) receives at least as large (and typically larger) weight than a less relevant token, and increasing its relevance cannot reduce or flip the sign of its contribution in the update within the dynamic layer (i.e., the update is monotone with resp

Figures (5)

  • Figure 1: A simple illustration of coupled (a) and decoupled (b) parameterizations ($e_i =$ embedding $i$; $s_{ij} =$ cosine similarity score between $e_i$ and $e_j$; $n_i =$ neuron $i$, ${n_i}^{(d)} =$ neuron $i$ in dynamic layer $d$; ${n_i}^{(s)} =$ neuron $i$ in static layer $s$; and $w_{ij} =$ weight corresponding to $e_i$ or ${n_i}^{(d)}$ used in the weighted sum of $n_j$ or ${n_j}^{(s)}$, respectively).
  • Figure 2: Throughput of Avey-B, ModernBERT, and NeoBERT on NVIDIA B200 GPUs with mixed precision (BF16). We use Avey-B base, ModernBERT base, and NeoBERT medium (the only publicly available size). Avey-B is shown in (a) as optimized using torch.compile (no fused-kernel implementation is available yet) and in (b) as unoptimized (eager). For ModernBERT and NeoBERT, throughput is shown for system–optimized ( with FlashAttention) and system–unoptimized (eager) variants in (a) and (b), respectively.
  • Figure 3: The throughput of Avey-B with and without the neural compressor.
  • Figure 4: Latency of Avey-B, ModernBERT, and NeoBERT on NVIDIA B200 GPUs with mixed precision (BF16). We use Avey-B base, ModernBERT base, and NeoBERT medium (the only publicly available size). Avey-B is shown in (a) as optimized using torch.compile (no fused-kernel implementation is available yet) and in (b) as unoptimized (eager). For ModernBERT and NeoBERT, latency is shown for system–optimized ( with FlashAttention) and system–unoptimized (eager) variants in (a) and (b), respectively.
  • Figure 5: Learned static cross-embedding projection matrices for the (a) coupled configuration (left or red) with 15 matrices uniformly subsampled from 30 static layers and (b) decoupled configuration (right or blue) with all 15 static matrices (dynamic and static layers are interleaved, hence, only 15 static matrices exist). For comparability, we display 15 layers per panel. The coupled setting exhibits diffuse, more homogeneous patterns (e.g., see layers 14, 22, 24, and 26) suggestive of redundancy, whereas the decoupled setting shows sharper, more heterogeneous structure and variability in spread, indicating greater representational diversity.

Theorems & Definitions (4)

  • Proposition A.1: dynamic layer monotonicity
  • proof
  • Proposition A.2: static layer non-violation
  • proof