Table of Contents
Fetching ...

On the Role of Batch Size in Stochastic Conditional Gradient Methods

Rustem Islamov, Roman Machacek, Aurelien Lucchi, Antonio Silveti-Falls, Eduard Gorbunov, Volkan Cevher

Abstract

We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.

On the Role of Batch Size in Stochastic Conditional Gradient Methods

Abstract

We study the role of batch size in stochastic conditional gradient methods under a -Kurdyka-Łojasiewicz (-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.
Paper Structure (42 sections, 6 theorems, 71 equations, 10 figures, 8 tables, 2 algorithms)

This paper contains 42 sections, 6 theorems, 71 equations, 10 figures, 8 tables, 2 algorithms.

Key Result

Theorem 4.1

Let Assumptions eq:smoothness, eq:norm_equiv, eq:mu_kl, and eq:bounded_variance hold. Let $m_{0} = g(x_{0};\xi_{0})$. Let the parameters of alg:spectral_gd_decay_fw and initialization $x_0$ be chosen as follows where $\mathcal{O}$ hides all numerical constants and $\tilde{\mathcal{O}}$ hides all numerical and logarithmic factors. Then, the output of alg:spectral_gd_decay_fw after $K$ iterations s

Figures (10)

  • Figure 1: Empirical verification of the validity of \ref{['asmp:mu_kl']} during the training of a 124M NanoGPT model. The points with a loss below 5 fit a linear function well, with a slope equal to $\mu$.
  • Figure 2: Empirical gradient variance and fitted power-law models as functions of batch size $B$ with fixed sequence length $S=1024$ ( left) and sequence length $S$ with fixed batch size $B=512$ ( right) when training a 124M NanoGPT model on the FineWeb dataset under a fixed token budget $T=2.7$B. For the left plot, the estimated scaling exponent is $\lambda \approx 0.9$ and $B_{\rm shift}\approx90$, while for the right plot they are $\lambda \approx 1.1$ and $S_{\rm shift} \approx 35$. The fitted models support the validity of \ref{['asmp:bounded_variance']}.
  • Figure 3: Comparison of batch size and sequence length scheduling strategies when training a 1B model. The restarting schemes (in yellow and gray) are compared against fixed schedules. The validation loss is evaluated with a smaller sequence length of $1024$. The values of batch sizes $B_{0,1,2}$, sequence lengths $S_{0,1}$, and Frank--Wolfe stepsizes $\beta_{0,1}$ are given in the legends. The notation $(B_{0,1,2}, S_{0,1,2}, \beta_{0,1})$ characterizes which batch size, sequence length, and Frank--Wolfe stepsize are used for the particular setup. The notation $(B_0,S_0,\beta_0)\to(B_{1,2}, S_{1,2},\beta_1)$ characterizes how parameters of Scion change after restart (e.g., batch size increases from $B_0$ to $B_{1,2}$), respectively. The notation $\mu$P or BST indicates the rule used to select $B, S$, and $\beta$.
  • Figure 4: Comparison of fixed large batch size strategies when training a 1B model. The validation loss is evaluated with a smaller sequence length $1024$. Scion with a batch size of $1024$ suggested by our BST scaling rule achieves the best performance compared to other baselines with batch sizes $2048$ and $4096$. The values of batch sizes $B_{1,2,3}$, sequence lengths $S$, and Frank--Wolfe stepsizes $\beta_{0,1}$ are given in the legends. The notation $(B_{1,2,3}, S, \beta_{0,1})$ characterizes which batch size, sequence length, and Frank--Wolfe stepsize are used for the particular setup, respectively. The notation BST indicates the rule used to select the $B, S$, and $\beta$.
  • Figure 5: The final performance of the 124M model when varying the Frank--Wolfe stepsize $\beta$ under different token budgets ( left: 2.7B, center: 5.3B, right: 8.0B). We average the train loss over 3 random seeds and report the moving average in the window of size 500. We observe that the BST scaling rule predicts a good estimate for the optimal $\beta$ when increasing the token budget. Moreover, the difference in performance between BST and $\mu$P baselines grows with a token budget.
  • ...and 5 more figures

Theorems & Definitions (15)

  • Theorem 4.1
  • Remark 4.1
  • Corollary 4.1: BST Scaling Rule
  • Remark 5.1
  • Remark 6.1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem D.1: Full statement of \ref{['thm:str_decay_mu_kl_expectation_no_restarts']}
  • ...and 5 more