Table of Contents
Fetching ...

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez

TL;DR

This study evaluates ReLoRA, a rank-expanding adaptation strategy, in small language models (11M and 66M parameters) to determine whether its merge-and-restart updates improve learning under tight capacity. Using Dolma for pretraining, Paloma for perplexity, and BLiMP for linguistic evaluation, the authors analyze learning dynamics via proportional effective rank (PER) and condition numbers (CN) of both weights and updates. Across measurements, ReLoRA underperforms full-rank training and amplifies rank deficiencies, with early training updates showing strong ill-conditioning, especially in smaller models. The findings suggest that benefits of ReLoRA in large models do not trivially transfer to low-resource pretraining, motivating adaptive or hybrid-rank approaches for efficient yet expressive pretraining in small-scale transformers.

Abstract

Parameter-efficient methods like LoRA have revolutionised large language model (LLM) fine-tuning. ReLoRA extends this idea to pretraining by repeatedly merging and reinitialising low-rank adapters, increasing cumulative rank while keeping updates cheap. This aligns well with observations that high-capacity models learn through locally low-rank trajectories that expand over time. By contrast, recent work suggests that small language models (SLMs) exhibit rank deficiencies and under-utilise their available dimensionality. This raises a natural question: can ReLoRA's rank-expanding update rule \textit{steer} SLMs toward healthier learning dynamics, mitigating rank bottlenecks in a capacity-constrained regime? We argue SLMs are an ideal testbed: they train quickly, enable controlled ablations, and make rank phenomena more measurable. We present the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Across loss, Paloma perplexity, and BLiMP, we find that ReLoRA underperforms full-rank training, with gaps widening at larger scales. Analysis of proportional effective rank and condition numbers shows that ReLoRA amplifies existing rank deficiencies and induces ill-conditioned updates early in training. Our results suggest that while ReLoRA's merge-and-restart strategy can expand ranks in larger models, it does not straightforwardly translate to capacity-limited SLMs, motivating adaptive-rank or hybrid-rank approaches for low-compute pretraining.

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

TL;DR

This study evaluates ReLoRA, a rank-expanding adaptation strategy, in small language models (11M and 66M parameters) to determine whether its merge-and-restart updates improve learning under tight capacity. Using Dolma for pretraining, Paloma for perplexity, and BLiMP for linguistic evaluation, the authors analyze learning dynamics via proportional effective rank (PER) and condition numbers (CN) of both weights and updates. Across measurements, ReLoRA underperforms full-rank training and amplifies rank deficiencies, with early training updates showing strong ill-conditioning, especially in smaller models. The findings suggest that benefits of ReLoRA in large models do not trivially transfer to low-resource pretraining, motivating adaptive or hybrid-rank approaches for efficient yet expressive pretraining in small-scale transformers.

Abstract

Parameter-efficient methods like LoRA have revolutionised large language model (LLM) fine-tuning. ReLoRA extends this idea to pretraining by repeatedly merging and reinitialising low-rank adapters, increasing cumulative rank while keeping updates cheap. This aligns well with observations that high-capacity models learn through locally low-rank trajectories that expand over time. By contrast, recent work suggests that small language models (SLMs) exhibit rank deficiencies and under-utilise their available dimensionality. This raises a natural question: can ReLoRA's rank-expanding update rule \textit{steer} SLMs toward healthier learning dynamics, mitigating rank bottlenecks in a capacity-constrained regime? We argue SLMs are an ideal testbed: they train quickly, enable controlled ablations, and make rank phenomena more measurable. We present the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Across loss, Paloma perplexity, and BLiMP, we find that ReLoRA underperforms full-rank training, with gaps widening at larger scales. Analysis of proportional effective rank and condition numbers shows that ReLoRA amplifies existing rank deficiencies and induces ill-conditioned updates early in training. Our results suggest that while ReLoRA's merge-and-restart strategy can expand ranks in larger models, it does not straightforwardly translate to capacity-limited SLMs, motivating adaptive-rank or hybrid-rank approaches for low-compute pretraining.

Paper Structure

This paper contains 49 sections, 6 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: LoRA decomposition hu_lora_2022, which can be applied to any linear operation parameterised by a matrix
  • Figure 2: Training trajectories of cross-entropy loss, Paloma perplexity and BLiMP score across the tiny and small models, plotted against GPU hours taken
  • Figure 3: Proportional effective rank (PER) values of the parameters of the OV Circuit and the SwiGLU $W_2$ matrix, averaged over the models' layers. Values are shown with the 95% confidence interval.
  • Figure 4: Proportional effective rank (PER) values of the gradient updates of the output and value projections in the attention mechanism and the SwiGLU $W_2$ matrix, averaged over the models' layers. Values are shown with bands representing the 95% confidence interval. The hollow circles represent NaN values.
  • Figure 5: Condition numbers of the weight matrices of the OV Circuit and the SwiGLU $W_2$ matrix, averaged over the models' layers. Values are shown with the 95% confidence interval.
  • ...and 3 more figures