Table of Contents
Fetching ...

Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

Felipe Maia Polo, Seamus Somerstep, Leshem Choshen, Yuekai Sun, Mikhail Yurochkin

TL;DR

This work introduces Sloth, a family-aware, latent-skills scaling framework that predicts LLM benchmark performance by modeling a small set of interpretable latent skills that govern compute-to-performance mappings. By sharing information across benchmarks and model families through a factor-analytic-like structure and a translog-like skill evolution, Sloth achieves accurate predictions with fewer parameters and yields actionable insights into how reasoning, knowledge, and instruction-following scale with compute. The approach is validated on 12 prominent benchmarks from Open LLM Leaderboard datasets, demonstrates interpretable loadings for latent skills, and extends to downstream-task prediction and compute-aware scaling. The results offer practical utilities for predicting larger-model performance, guiding resource allocation, and forecasting behavior under scaled inference compute. Overall, Sloth provides a scalable, interpretable framework to understand and forecast LLM capabilities across diverse benchmarks and families.

Abstract

Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens, but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance accurately and offers insights into scaling behaviors for complex downstream tasks, increased test-time compute, and compute-optimal scaling of skills.

Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

TL;DR

This work introduces Sloth, a family-aware, latent-skills scaling framework that predicts LLM benchmark performance by modeling a small set of interpretable latent skills that govern compute-to-performance mappings. By sharing information across benchmarks and model families through a factor-analytic-like structure and a translog-like skill evolution, Sloth achieves accurate predictions with fewer parameters and yields actionable insights into how reasoning, knowledge, and instruction-following scale with compute. The approach is validated on 12 prominent benchmarks from Open LLM Leaderboard datasets, demonstrates interpretable loadings for latent skills, and extends to downstream-task prediction and compute-aware scaling. The results offer practical utilities for predicting larger-model performance, guiding resource allocation, and forecasting behavior under scaled inference compute. Overall, Sloth provides a scalable, interpretable framework to understand and forecast LLM capabilities across diverse benchmarks and families.

Abstract

Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens, but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance accurately and offers insights into scaling behaviors for complex downstream tasks, increased test-time compute, and compute-optimal scaling of skills.

Paper Structure

This paper contains 43 sections, 1 theorem, 26 equations, 40 figures, 2 tables.

Key Result

Theorem A.2

Given that the true set of model parameters is $(\Lambda, b, B)$, if there is another set of parameters $(\Tilde{\Lambda}, \Tilde{b}, \Tilde{B})$ that satisfy then, under the Assumption assump:ident, we have $\Tilde{b}=b$, $\Tilde{\Lambda} = \Lambda M$, and $\Tilde{B} = B (M^\top)^{-1}$ for an invertible matrix $M\in {\mathbf{R}}^{d\times d}$. In particular, $M$ is orthogonal if $\Psi=I_d$, i.e

Figures (40)

  • Figure 1: The figure shows the average (across LLM families) mean-absolute-error (MAE) (within a family) for different methods. Sloth performs competitively, with errors similar to or lower than the "Size and Tokens" variant, indicating its effective inductive bias.
  • Figure 2: Needed skills for each benchmark. In this figure, we report the estimated loadings $\Lambda$ and, based on their values, we give them appropriate names.
  • Figure 3: Running Sloth with shared intercept can offer a great way to model scaling laws that are family-independent.
  • Figure 4: In this figure, we plot the skill levels (output) subtracted by the family-specific intercept terms against inputs in the x and y-axis. From these plots, we can see how each one of the inputs can differently affect the production of skills. For example, "Reasoning" showed to be more affected by model size than tokens when compared to other skills. Moreover, "Knowledge" is more influenced by inputs (level curves are steeper) in general, while the other skills should be more sensitive to other family-dependent factors.
  • Figure 5: We compare the skills of base (x-axis) and instruction-tuned models (y-axis); if a model lies on the 45-degree line, it means that the model has the same skill level in its base and instruct versions. Gains from instruction tuning (IT) for different families on three latent skills. Findings include a large and positive impact on "Instruction Following" and that provides much larger variations in this skill when compared to inputs seen in Figure \ref{['fig:level-curves']}. Moreover, IT had a moderate and negative effect on "Reasoning" and mixed effects on "Knowledge".
  • ...and 35 more figures

Theorems & Definitions (2)

  • Theorem A.2
  • proof