Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Yuval Weiss; David Demitri Africa; Paula Buttery; Richard Diehl Martinez

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez

TL;DR

This study evaluates ReLoRA, a rank-expanding adaptation strategy, in small language models (11M and 66M parameters) to determine whether its merge-and-restart updates improve learning under tight capacity. Using Dolma for pretraining, Paloma for perplexity, and BLiMP for linguistic evaluation, the authors analyze learning dynamics via proportional effective rank (PER) and condition numbers (CN) of both weights and updates. Across measurements, ReLoRA underperforms full-rank training and amplifies rank deficiencies, with early training updates showing strong ill-conditioning, especially in smaller models. The findings suggest that benefits of ReLoRA in large models do not trivially transfer to low-resource pretraining, motivating adaptive or hybrid-rank approaches for efficient yet expressive pretraining in small-scale transformers.

Abstract

Parameter-efficient methods like LoRA have revolutionised large language model (LLM) fine-tuning. ReLoRA extends this idea to pretraining by repeatedly merging and reinitialising low-rank adapters, increasing cumulative rank while keeping updates cheap. This aligns well with observations that high-capacity models learn through locally low-rank trajectories that expand over time. By contrast, recent work suggests that small language models (SLMs) exhibit rank deficiencies and under-utilise their available dimensionality. This raises a natural question: can ReLoRA's rank-expanding update rule \textit{steer} SLMs toward healthier learning dynamics, mitigating rank bottlenecks in a capacity-constrained regime? We argue SLMs are an ideal testbed: they train quickly, enable controlled ablations, and make rank phenomena more measurable. We present the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Across loss, Paloma perplexity, and BLiMP, we find that ReLoRA underperforms full-rank training, with gaps widening at larger scales. Analysis of proportional effective rank and condition numbers shows that ReLoRA amplifies existing rank deficiencies and induces ill-conditioned updates early in training. Our results suggest that while ReLoRA's merge-and-restart strategy can expand ranks in larger models, it does not straightforwardly translate to capacity-limited SLMs, motivating adaptive-rank or hybrid-rank approaches for low-compute pretraining.

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

TL;DR

Abstract

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)