Table of Contents
Fetching ...

Neural Scaling Laws for Deep Regression

Tilen Cadez, Kyoung-Min Kim

TL;DR

This work investigates neural scaling laws for deep regression by predicting magnetic Hamiltonian parameters from simulated twisted bilayer CrI3 images. Across fully connected networks, ResNet-18, and Vision Transformer architectures, the mean-squared error exhibits power-law decay with both dataset size $N_D$ and model size $N_M$, with exponents up to around $2.3$ that depend on the regressed parameter and architecture. The study demonstrates that larger data and/or larger models can substantially improve accuracy, providing practical guidelines for resource allocation in scientific regression tasks and highlighting the need for a theoretical framework to explain the emergent power laws. It also notes architecture-dependent scaling behavior and the importance of an adjustable learning-rate schedule to reveal consistent scaling over wide data regimes.

Abstract

Neural scaling laws--power-law relationships between generalization errors and characteristics of deep learning models--are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures--including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.

Neural Scaling Laws for Deep Regression

TL;DR

This work investigates neural scaling laws for deep regression by predicting magnetic Hamiltonian parameters from simulated twisted bilayer CrI3 images. Across fully connected networks, ResNet-18, and Vision Transformer architectures, the mean-squared error exhibits power-law decay with both dataset size and model size , with exponents up to around that depend on the regressed parameter and architecture. The study demonstrates that larger data and/or larger models can substantially improve accuracy, providing practical guidelines for resource allocation in scientific regression tasks and highlighting the need for a theoretical framework to explain the emergent power laws. It also notes architecture-dependent scaling behavior and the importance of an adjustable learning-rate schedule to reveal consistent scaling over wide data regimes.

Abstract

Neural scaling laws--power-law relationships between generalization errors and characteristics of deep learning models--are vital tools for developing reliable models while managing limited resources. Although the success of large language models highlights the importance of these laws, their application to deep regression models remains largely unexplored. Here, we empirically investigate neural scaling laws in deep regression using a parameter estimation model for twisted van der Waals magnets. We observe power-law relationships between the loss and both training dataset size and model capacity across a wide range of values, employing various architectures--including fully connected networks, residual networks, and vision transformers. Furthermore, the scaling exponents governing these relationships range from 1 to 2, with specific values depending on the regressed parameters and model details. The consistent scaling behaviors and their large scaling exponents suggest that the performance of deep regression models can improve substantially with increasing data size.

Paper Structure

This paper contains 9 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Neural scaling laws in a parameter estimation model. Panels (a)–(c) show the geometric mean of the mean squared error (MSE) test loss in the prediction of three magnetic parameters ($\theta$, $J$, $D$), respectively, as a function of training dataset size. Each marker corresponds to a different architecture: fully connected networks (FCNs), residual network (ResNet-18), and vision transformer (ViT). The two FCNs, indicated by red and yellow squares, have different numbers of neurons per hidden layer ($n_n=4$ and $n_n=512$, respectively) with the same number of hidden layers ($n_l=3$). In each panel, dashed lines represent power-law fits of the form $\epsilon \sim N_D^{-\alpha_D}$, where $\epsilon$ is the loss, $N_D$ is the training dataset size, and $\alpha_D$ is the scaling exponent. All data are plotted on a log-log scale, with dataset sizes ranging from 224 to 114688.
  • Figure 2: Dataset and regression model used in the scaling analysis of generalization error. (a) Dataset: simulated magnetic domain images generated via atomistic spin simulations of twisted bilayer CrI3. Red (blue) indicates $+1$$(-1)$ of the local out-of-plane normalized magnetization. Only images of the top layer are shown. (b) Regression model: the parameter estimation model predicts the magnetic Hamiltonian parameters $(\theta, J, D)$ for twisted bilayer CrI3 from the input magnetic domain images.
  • Figure 3: Statistics of MSE test loss across different realizations of FCNs. (a)–(d) Distributions of test loss across multiple network realizations $N_r$ for the regression of $J$, with bars indicating the number of networks $N_{\mathrm{bin}}$ within each loss range. Each panel corresponds to a specific combination of the number of neurons per hidden layer $n_n$ and dataset size $N_D$, with a fixed number of hidden layers $n_l=3$: panels (a) and (c) show results for $n_n=4$ and $N_D = 3584$; panels (b) and (d) are for $n_n=16$ and $N_D = 28672$. Panels (a) and (b) display distributions over $N_r = 20$ network realizations, whereas panels (c) and (d) show results with $N_r = 100$ realizations. In each panel, black, blue, and red lines denote the arithmetic mean, geometric mean, and median of each distribution, respectively, with shaded regions representing the standard errors of the arithmetic and geometric means and the median absolute deviation, each depicted in their corresponding colors. (e) Evolution of the geometric mean with increasing number of network realizations. Results for four model sizes ($n_n=4, 16, 64, 256$) are shown in red, orange, green, and blue. For each size, data are presented across three testing dataset sizes $N_D = 1792$, 7168, and 28672, marked by square, pentagon, and hexagon symbols, respectively. Error bars indicate the standard deviation of the geometric means across bootstrap samples. The bootstrap procedure involves generating 50 random subsets from the data of 100 networks, computing the geometric mean for each subset, and plotting the average of these means.
  • Figure 4: Geometric mean of MSE test loss as a function of training dataset size. Panels (a)–(c) show results for the regression of $\theta$, $J$, and $D$, respectively. Each marker represents FCNs with varying $n_n$ from 4 to 512 with a fixed value of $n_l=3$. Panel (d) presents the result for the regression of $J$, with each marker indicating different values of $n_l$ and a fixed value of $n_n=16$. In each panel, dashed lines are power-law fits of the form $\epsilon \sim N_D^{-\alpha_D}$. In panel (d), the fitted values of $\alpha_D$ are included in the legend. All data are plotted on a log-log scale, with dataset sizes ranging from 224 to 114688. The green square in panels (a)--(c) indicates the result reported in Ref. Lee_2024.
  • Figure 5: Power-law scaling exponents $\alpha_D$ as a function of FCN model size. Each marker displays the fitted parameter of $\alpha_D$ from the MSE test loss data in Fig. \ref{['fig4']}(a)–(c) for $\theta$, $J$, or $D$, respectively. The FCN model size $N_M$ ranges from approximately $8 \times 10^{4}$ to $10^{7}$ parameters, corresponding to FCNs with $n_n$ from 4 to 512 and $n_l=3$. Dashed lines represent logarithmic fits of the form $\alpha_D = a_D \log_{10}(N_M/10^6) + b_D$, where $a_D$ and $b_D$ are fitting parameters. The data points for the largest value of $N_M$ are omitted from the fit. Circle and diamond markers indicate the $\alpha_D$ values for ResNet-18 and ViT, respectively.
  • ...and 2 more figures