Table of Contents
Fetching ...

Interpretable Machine Learning for Predicting Startup Funding, Patenting, and Exits

Saeid Mashhadi, Amirhossein Saghezchi, Vesal Ghassemzadeh Kashani

TL;DR

This study tackles the challenge of forecasting startup outcomes by building an interpretable, leakage-safe ML pipeline that integrates financing histories and patent stocks from Crunchbase and USPTO data. The authors construct a non-overlapping firm-quarter panel spanning 2010–2023, train exclusively on a development window (2010–2019), and evaluate on out-of-time holdout (2020–2021) and final (2022–2023) cohorts for three horizons: funding within $12$ months, patent-stock growth within $24$ months, and exits within $36$ months. They compare linear and tree-based models under inverse-prevalence weighting and SMOTE-NC, selecting winners by PR-AUC (with AUROC as a tiebreaker) and providing SHAP/importance-based interpretability, calibration checks, and out-of-time scored target lists. Key findings show strong predictability of patent growth, substantial but lower predictability for near-term funding, and meaningful yet modest gains for exit forecasting, with interpretable drivers such as financing recency, firm age, investment depth, and IP stocks aligning with economic priors. The framework offers actionable, ranked screening outputs for investors and policymakers while maintaining rigorous leakage controls and transparent explanations, indicating practical value for innovation finance research and decision-making.

Abstract

This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.

Interpretable Machine Learning for Predicting Startup Funding, Patenting, and Exits

TL;DR

This study tackles the challenge of forecasting startup outcomes by building an interpretable, leakage-safe ML pipeline that integrates financing histories and patent stocks from Crunchbase and USPTO data. The authors construct a non-overlapping firm-quarter panel spanning 2010–2023, train exclusively on a development window (2010–2019), and evaluate on out-of-time holdout (2020–2021) and final (2022–2023) cohorts for three horizons: funding within months, patent-stock growth within months, and exits within months. They compare linear and tree-based models under inverse-prevalence weighting and SMOTE-NC, selecting winners by PR-AUC (with AUROC as a tiebreaker) and providing SHAP/importance-based interpretability, calibration checks, and out-of-time scored target lists. Key findings show strong predictability of patent growth, substantial but lower predictability for near-term funding, and meaningful yet modest gains for exit forecasting, with interpretable drivers such as financing recency, firm age, investment depth, and IP stocks aligning with economic priors. The framework offers actionable, ranked screening outputs for investors and policymakers while maintaining rigorous leakage controls and transparent explanations, indicating practical value for innovation finance research and decision-making.

Abstract

This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.

Paper Structure

This paper contains 27 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: End-to-end data processing and modeling pipeline. Steps 1–5 summarize the sequential workflow from data preparation to out-of-sample prediction. The lower ribbon highlights leakage controls and reproducibility safeguards, including development-only preprocessing, resampling confined to training data, and deterministic artifacts for replication.
  • Figure 2: Funding (12m), LightGBM trained with inverse-prevalence weights. Bars show global feature importance measured by mean absolute SHAP values ($\text{mean}(|SHAP|)$) on the final evaluation window. Financing recency and firm age dominate, followed by cumulative capital raised and investor breadth.
  • Figure 3: Funding (12m), LightGBM trained with inverse-prevalence weights. Quantile-binned calibration curve on the final window. The model shows optimism at higher predicted probabilities relative to the $45^\circ$ reference line, suggesting that isotonic recalibration could improve probability calibration without affecting ranking.
  • Figure 4: Funding (12m), LightGBM trained with inverse-prevalence weights. Partial dependence plots on the final window for the top six predictors. The probability of next-round funding decreases with time since the last round and firm age, and rises with cumulative funding, investor breadth, and round count. Higher patent stock shows a mild negative effect. Note: Uncalibrated probabilities appear overconfident at high scores, but ranking remains unaffected.
  • Figure 5: Patent growth (24m), Random Forest trained with inverse-prevalence weights. Bars show global feature importance based on mean decrease in impurity (MDI, Gini). Firm age, time since last round, and cumulative capital raised dominate the ranking, followed by patent and citation stocks. Note: MDI can overemphasize high-cardinality or correlated predictors.
  • ...and 5 more figures