Table of Contents
Fetching ...

Predicting Company Growth using Scaling Theory informed Machine Learning

Ruyi Tao, Veronica R. Cappelli, Kaiwei Liu, Marcus J. Hamilton, Christopher P. Kempes, Geoffrey B. Wes, Jiang Zhang

TL;DR

This work introduces STIML, a hybrid framework that forecasts company growth by decomposing dynamics into a mechanistic trend based on a generalized scaling growth model and learnable fluctuations captured by data-driven models. The GM predictor for each financial indicator uses power-law scaling with assets, parameterized by $x=c_xA^{\beta_x}$, and combined via an Euler-based solution to obtain $x^{GM}$; STIML then models residuals $\mathbf{Y}-\mathbf{X}^{GM}$ with encoders/decoders such as GM-MLP or GM-iTransformer. Across 31,553 firms (1950–2019) with 16 indicators, STIML achieves higher predictive accuracy than both GM and purely data-driven baselines, with larger gains for big, stable firms and high-volatility regimes, and exhibits interpretability through latent representations and SHAP-based feature attributions. The results suggest that macroeconomic factors provide limited predictive value on average at the firm level, while asymmetries in deviations from scaling laws reveal learnable structure, pointing to directions for refining mechanistic models and incorporating asymmetric fluctuations. Overall, STIML demonstrates regime-dependent predictability in company growth and offers a principled framework to combine mechanistic insight with flexible learning for complex economic time series.

Abstract

Predicting company growth is a critical yet challenging task because observed dynamics blend an underlying structural growth trend with volatile fluctuations. Here, we propose a Scaling-Theory-Informed Machine Learning (STIML) framework that integrates a scaling-based growth model to capture the mechanism-driven average trend, together with a data-driven forecasting model to learn the residual fluctuations. Using Compustat annual financial statement data (1950--2019) for 31,553 North American companies, we extend the growth model beyond assets to multiple financial indicators, and evaluate STIML against growth model-only and purely data-driven baselines. Across 16 target variables, we show that company growth exhibits a clear separation between trend-driven predictability and fluctuation-driven predictability, with their relative importance depending strongly on company size and volatility. Interpretability analyses further show that STIML captures multivariate dependencies beyond simple autocorrelation, and that macroeconomic variables contribute significantly less to predictive performance on average. Moreover, we find the scaling-based growth model overlooks asymmetric deviations, which instead contain the structured and learnable signals, suggesting a path to refine mechanistic models.

Predicting Company Growth using Scaling Theory informed Machine Learning

TL;DR

This work introduces STIML, a hybrid framework that forecasts company growth by decomposing dynamics into a mechanistic trend based on a generalized scaling growth model and learnable fluctuations captured by data-driven models. The GM predictor for each financial indicator uses power-law scaling with assets, parameterized by , and combined via an Euler-based solution to obtain ; STIML then models residuals with encoders/decoders such as GM-MLP or GM-iTransformer. Across 31,553 firms (1950–2019) with 16 indicators, STIML achieves higher predictive accuracy than both GM and purely data-driven baselines, with larger gains for big, stable firms and high-volatility regimes, and exhibits interpretability through latent representations and SHAP-based feature attributions. The results suggest that macroeconomic factors provide limited predictive value on average at the firm level, while asymmetries in deviations from scaling laws reveal learnable structure, pointing to directions for refining mechanistic models and incorporating asymmetric fluctuations. Overall, STIML demonstrates regime-dependent predictability in company growth and offers a principled framework to combine mechanistic insight with flexible learning for complex economic time series.

Abstract

Predicting company growth is a critical yet challenging task because observed dynamics blend an underlying structural growth trend with volatile fluctuations. Here, we propose a Scaling-Theory-Informed Machine Learning (STIML) framework that integrates a scaling-based growth model to capture the mechanism-driven average trend, together with a data-driven forecasting model to learn the residual fluctuations. Using Compustat annual financial statement data (1950--2019) for 31,553 North American companies, we extend the growth model beyond assets to multiple financial indicators, and evaluate STIML against growth model-only and purely data-driven baselines. Across 16 target variables, we show that company growth exhibits a clear separation between trend-driven predictability and fluctuation-driven predictability, with their relative importance depending strongly on company size and volatility. Interpretability analyses further show that STIML captures multivariate dependencies beyond simple autocorrelation, and that macroeconomic variables contribute significantly less to predictive performance on average. Moreover, we find the scaling-based growth model overlooks asymmetric deviations, which instead contain the structured and learnable signals, suggesting a path to refine mechanistic models.

Paper Structure

This paper contains 21 sections, 5 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Performance of the extended GM and the STIML framework.(a). The cumulative distribution of MAE for the extended GM and the constant model. (b). The framework of STIML. The model consists of two components: the mechanistic model captures the average trend of company growth, while time-series prediction techniques capture the fluctuations. By combining both components, the final prediction of future growth is obtained.
  • Figure 2: Performance comparison of STIML and baseline models. (a). Bar chart of the average MAE across 16 predicted variables for 5-step-ahead forecasting. The error bars indicate the standard deviation across repeated experimental runs. The legend reports the overall average MAE across all variables for each model. (b). The variation of prediction error with the prediction step for four core financial indicators, including Assets (AT), Liabilities (LT), Revenues(REVT) and Cost of Goods(COGS). The shadow areas indicate the standard deviation across repeated experimental runs. The corresponding step-wise error curves for all 16 target variables are provided in \ref{['fig:fin_maes_si']}.
  • Figure 3: Comparing model performance across different grouping strategies.(a) Comparison of the MAE across different size groups of companies for different models. Each group corresponds to a range of average company assets (in USD), with categories defined as follows: micro $\in (0,10^6]$, small $\in [10^6, 10^8]$, mid $\in [10^8, 10^9]$, and large $\in [10^9,\infty]$. Different subplots represent different size groups. Each subplot displays the average MAE for three models—GM, ML, and GM-ML—on four core variables. The bars are color-coded as follows: blue for GM, yellow for ML, and red for GM-ML. (b) and (c) illustrates how the difference in average MAE between models varies across different subgroups. The blue dashed line represents the average MAE of GM minus that of GM-ML, while the yellow solid line represents the average MAE of ML minus GM-ML. The shaded areas denote the standard deviation range across multiple experimental runs. Panel (b) groups companies by size, while Panel (c) groups them by standard deviation. $\sigma(s)$ is the std. of size and $\sigma(r)$ is the std. of growth rate. The gray dots in the background of panel (c) are the original company data.
  • Figure 4: Error asymmetry analysis. (a) MAE distributions by error sign (over- vs. underestimation) for the four core financial indicators under GM (first row), ML (second row), and GM-ML (third row). Green denotes the underestimation MAEs, and orange denotes the overestimation MAEs. (b) Asymmetric performance gains, with gain curves decomposed into over- and underestimation components.
  • Figure 5: Visualization of feature representations learned in GM-iTransformer model. PCA visualization of neural-network hidden-layer feature representations, colored by firm size (left), age (middle), and sector (right).
  • ...and 8 more figures