Table of Contents
Fetching ...

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

Sangyoon Lee, Jaeho Lee

TL;DR

This paper tackles the mismatch and contradictions in reported gains for LoRA variants by revealing batch size as a major confound in evaluations. Through a unified experimental framework that varies batch size, learning rate, and protocol, the authors show that vanilla LoRA can match or beat PiSSA and MiLoRA when batch size is optimized, reconciling prior claims. They dissect how the optimal batch size interacts with LoRA rank, dataset scale, and base-model capacity, and propose a low-cost proxy using small-scale, low-rank configurations on the full dataset to identify transferable batch-size settings. The work offers practical guidance for robust evaluation and efficient deployment of LoRA-based fine-tuning in resource-constrained environments, highlighting that the optimal batch size is not universally small and that careful tuning is essential for credible comparisons.

Abstract

Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical gains, often on the same benchmarks. We show that these contradictions arise from a single overlooked factor: the batch size. When properly tuned, vanilla LoRA often matches the performance of more complex variants. We further propose a proxy-based, cost-efficient strategy for batch size tuning, revealing the impact of rank, dataset size, and model capacity on the optimal batch size. Our findings elevate batch size from a minor implementation detail to a first-order design parameter, reconciling prior inconsistencies and enabling more reliable evaluations of LoRA variants.

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

TL;DR

This paper tackles the mismatch and contradictions in reported gains for LoRA variants by revealing batch size as a major confound in evaluations. Through a unified experimental framework that varies batch size, learning rate, and protocol, the authors show that vanilla LoRA can match or beat PiSSA and MiLoRA when batch size is optimized, reconciling prior claims. They dissect how the optimal batch size interacts with LoRA rank, dataset scale, and base-model capacity, and propose a low-cost proxy using small-scale, low-rank configurations on the full dataset to identify transferable batch-size settings. The work offers practical guidance for robust evaluation and efficient deployment of LoRA-based fine-tuning in resource-constrained environments, highlighting that the optimal batch size is not universally small and that careful tuning is essential for credible comparisons.

Abstract

Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical gains, often on the same benchmarks. We show that these contradictions arise from a single overlooked factor: the batch size. When properly tuned, vanilla LoRA often matches the performance of more complex variants. We further propose a proxy-based, cost-efficient strategy for batch size tuning, revealing the impact of rank, dataset size, and model capacity on the optimal batch size. Our findings elevate batch size from a minor implementation detail to a first-order design parameter, reconciling prior inconsistencies and enabling more reliable evaluations of LoRA variants.
Paper Structure (21 sections, 6 figures, 1 table)

This paper contains 21 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Impact of batch size across LoRA variants. We observe that batch size selection alone can lead to a performance gap of over 10% in accuracy. Notably, when evaluated at its optimal batch size, vanilla LoRA beats both PiSSA and MiLoRA in math reasoning task.
  • Figure 2: Effect of batch size across key determinants. We examine the interaction between batch size and three factors: (a) LoRA rank: the impact of batch size remains consistent across varying ranks $r$; (b) dataset scale: larger data regimes effectively leverage larger batch training; and (c) base model capacity: batch size effects are largely invariant to model scale. For a unified comparison, accuracies are normalized by shifting the maximum value of each setup to match the default configuration ($r=128$, 100K samples, 7B model). Original accuracy values are provided in Figure \ref{['fig:orr_acc']} for reference.
  • Figure 3: Batch size effect under fixed training steps. Under a fixed optimization steps, larger batch sizes lead to superior performance due to increased data throughput. This trend persists until a critical threshold is reached, where increasing batch size beyond no longer yields improvements in test accuracy.
  • Figure 4: Impact of warm-up phase and lr scheduling. We show that while removing the warm-up phase maintains robust performance across all batch sizes without significant accuracy loss, the LR scheduling remains critical to model performance.
  • Figure 5: Interaction between optimal learning rate and batch size. We demonstrate that the optimal learning rate follows a non-monotonic trajectory as batch size increases, initially scaling upward before declining beyond a critical threshold.
  • ...and 1 more figures