NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

Ethan Smith

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

Ethan Smith

TL;DR

This work introduces NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers to improve training efficiency and identifies one caveat: Mixup/CutMix augmentation interferes with NOBLE's benefits in Imagenet classification along with other stochastic augmentations.

Abstract

We introduce NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), an architectural augmentation that adds nonlinear low-rank branches to transformer linear layers. Unlike LoRA and other parameter-efficient fine-tuning (PEFT) methods, NOBLE is designed for pretraining from scratch. The branch is a permanent part of the architecture as opposed to an adapter for finetuning on top of frozen weights. The branch computes σ(xWdown)Wup where σ is a learnable nonlinearity. We evaluate several activation functions and find that CosNet, a two-layer cosine nonlinearity with learnable frequency and phase with a linear projection in between them in the bottleneck space, performs best. NOBLE achieves substantial improvements with minimal overhead: up to 1.47x step speedup to reach baseline eval loss (up to 32% fewer training steps), with as low as 4% additional parameters and 7% step time overhead, resulting in up to 1.22x net wallclock speedup. Experiments on LLMs (250M and 1.5B parameters), BERT, VQGAN, and ViT consistently show improved training efficiency. We identify one caveat: Mixup/CutMix augmentation interferes with NOBLE's benefits in Imagenet classification along with other stochastic augmentations, but when disabled, ViT also improves. This discrepancy is possibly explained by regularization techniques that encourage smoother fits to the target function while NOBLE may specialize more in sharper aspects of the target function.

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

TL;DR

Abstract

Paper Structure (45 sections, 4 equations, 5 figures, 5 tables)

This paper contains 45 sections, 4 equations, 5 figures, 5 tables.

Introduction
Related Work
Low-Rank Adaptation and PEFT.
Nonlinear Extensions of LoRA.
Nonlinear Projections in Attention.
Mixture of Experts.
Periodic Activations.
Method
NOBLE Architecture
CosNet: Recommended Nonlinearity
Key Design Choices
Near-Zero Initialization of $W_\text{up}$.
Reduced Main Weight Initialization.
Learning Rate Scaling.
CosNet Initialization.
...and 30 more sections

Figures (5)

Figure 1: Eval loss curves for 1.5B LLM pretraining on OpenWebText. NOBLE (solid) reaches baseline's best loss (2.51) in 143--154k steps vs 196k (1.34--1.37$\times$ faster). First 20% of training truncated.
Figure 2: Comparison of low-rank adaptation architectures. (a) NOBLE with CosNet: a sandwich of cosine activations with learnable frequency/phase, connected by a mixing projection, before projecting back to full dimensions. Key differences: trained from scratch as part of the architecture (not fine-tuning), cosine activation is symmetric and non-saturating. (b) MosLoRA adds a linear projection between low-rank matrices $BMA$, similar structure but without nonlinear activations. (c--d) Recent nonlinear variants (AuroRA, NoLoRA) introduce activations in the LoRA bottleneck for fine-tuning. (e) Standard LoRA adds a linear low-rank bypass $BA$.
Figure 3: Eval loss curves for autoregressive language modeling. Left: LLM Base 250M. Right: LLM Large 1.5B. NOBLE configurations (solid) consistently achieve lower loss than baseline (dashed) throughout training. Higher ranks provide greater improvement. First 20% of training truncated for clarity.
Figure 4: ViT-S training loss and validation accuracy on ImageNet-1k (second half of training shown). Left: Training loss. Without Mixup/CutMix, NOBLE achieves 5% lower training loss. Meanwhile when training with Mixup/CutMix, it is not clear that NOBLE provides any benefit. Right: Validation accuracy. Mixup/CutMix significantly boosts accuracy (67% $\rightarrow$ 74%), NOBLE does not provide much improvement in either condition.
Figure 5: Activation function comparison. Eval loss curves for different bottleneck activations (rank 64, 150k steps). CosNet (2-layer cosine, red) achieves the lowest loss, followed by single cosine and 3-layer CosNet. ReLU-like activations (GELU, LeakyReLU) provide moderate improvement, while tanh shows minimal benefit. First 20% of training truncated for clarity.

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

TL;DR

Abstract

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

Authors

TL;DR

Abstract

Table of Contents

Figures (5)