Neural Scaling Laws for Deep Regression
Tilen Cadez, Kyoung-Min Kim
TL;DR
This work investigates neural scaling laws for deep regression by predicting magnetic Hamiltonian parameters from simulated twisted bilayer CrI3 images. Across fully connected networks, ResNet-18, and Vision Transformer architectures, the mean-squared error exhibits power-law decay with both dataset size $N_D$ and model size $N_M$, with exponents up to around $2.3$ that depend on the regressed parameter and architecture. The study demonstrates that larger data and/or larger models can substantially improve accuracy, providing practical guidelines for resource allocation in scientific regression tasks and highlighting the need for a theoretical framework to explain the emergent power laws. It also notes architecture-dependent scaling behavior and the importance of an adjustable learning-rate schedule to reveal consistent scaling over wide data regimes.
Abstract
Neural scaling laws--power-law relationships between generalization errors and characteristics of deep learning models--are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures--including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.
