Table of Contents
Fetching ...

How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines

Ayan Sengupta, Yash Goel, Tanmoy Chakraborty

TL;DR

The paper surveys neural scaling laws, examining power-law relationships among model size, data, and compute across language, vision, multimodal, and RL domains, while highlighting deviations in sparse, MoE, and retrieval-augmented settings. It offers a taxonomy and eight research questions that connect theory to practice, detailing guidelines on data composition (D-CPT), test-time inference, and architectural choices like PEFT and MoEs. The analysis reveals that traditional scaling laws are not universally applicable, especially under real-world constraints and advanced architectures, underscoring the need for adaptive, data-efficient, and inference-aware strategies. The authors advocate for practical benchmarks and sustainable AI practices, arguing that downscaling and multi-objective optimization can achieve competitive performance with lower resource costs and broader accessibility.

Abstract

Neural scaling laws have revolutionized the design and optimization of large-scale AI models by revealing predictable relationships between model size, dataset volume, and computational resources. Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies. However, recent studies highlighted their limitations across architectures, modalities, and deployment contexts. Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns. Moreover, scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches. In this survey, we synthesize insights from over 50 studies, examining the theoretical foundations, empirical findings, and practical implications of scaling laws. We also explore key challenges, including data efficiency, inference scaling, and architecture-specific constraints, advocating for adaptive scaling strategies tailored to real-world applications. We suggest that while scaling laws provide a useful guide, they do not always generalize across all architectures and training strategies.

How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines

TL;DR

The paper surveys neural scaling laws, examining power-law relationships among model size, data, and compute across language, vision, multimodal, and RL domains, while highlighting deviations in sparse, MoE, and retrieval-augmented settings. It offers a taxonomy and eight research questions that connect theory to practice, detailing guidelines on data composition (D-CPT), test-time inference, and architectural choices like PEFT and MoEs. The analysis reveals that traditional scaling laws are not universally applicable, especially under real-world constraints and advanced architectures, underscoring the need for adaptive, data-efficient, and inference-aware strategies. The authors advocate for practical benchmarks and sustainable AI practices, arguing that downscaling and multi-objective optimization can achieve competitive performance with lower resource costs and broader accessibility.

Abstract

Neural scaling laws have revolutionized the design and optimization of large-scale AI models by revealing predictable relationships between model size, dataset volume, and computational resources. Early research established power-law relationships in model performance, leading to compute-optimal scaling strategies. However, recent studies highlighted their limitations across architectures, modalities, and deployment contexts. Sparse models, mixture-of-experts, retrieval-augmented learning, and multimodal models often deviate from traditional scaling patterns. Moreover, scaling behaviors vary across domains such as vision, reinforcement learning, and fine-tuning, underscoring the need for more nuanced approaches. In this survey, we synthesize insights from over 50 studies, examining the theoretical foundations, empirical findings, and practical implications of scaling laws. We also explore key challenges, including data efficiency, inference scaling, and architecture-specific constraints, advocating for adaptive scaling strategies tailored to real-world applications. We suggest that while scaling laws provide a useful guide, they do not always generalize across all architectures and training strategies.

Paper Structure

This paper contains 29 sections, 20 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Papers surveyed under different categories. The detailed paper list is provided in Table \ref{['tab:database1']} of Appendix \ref{['appx:details']}.
  • Figure 2: A taxonomy of neural scaling laws.
  • Figure 3: Number of paper studied in this survey paper for different model architectures (a), scaling variables (b) and scaling tasks (c). The detailed paper list is provided in Table \ref{['tab:database1']} of Appendix \ref{['appx:details']}.
  • Figure 4: Practical roadmap summarizing training and inference strategies grounded in our eight research questions and taxonomy branches. (a) Training scaling strategies can be utilized for pre-training or fine-tuning unimodal and multimodal foundational and domain-adapted models. (b) Post-training inference strategies can be followed to ensure that the model is utilized efficiently for the downstream applications.