Table of Contents
Fetching ...

Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

Lele Cao, Vilhelm von Ehrenheim, Sebastian Krakowski, Xiaoxue Li, Alexandra Lutz

TL;DR

This work targets the challenge of predicting unicorn startups in early-stage VC by synthesizing deep learning approaches across a nine-task lifecycle, from problem framing to deployment. It advocates a practical pipeline that emphasizes multi-modal and extrinsic data, investor-centric data splits, data balance and imputation strategies, and portfolio-based evaluation to reflect real-world ROI. A strong emphasis is placed on explainability (global and instance-level) and keeping humans in the loop to enhance trust and adapt to changing investment criteria. The paper provides actionable guidance for practitioners, highlighting compatibility filtering before prediction, simple yet effective data preprocessing, and model-agnostic explanations, with implications for developing transparent, scalable, and ROI-informed DL systems for startup sourcing.

Abstract

Startups often represent newly established business models associated with disruptive innovation and high scalability. They are commonly regarded as powerful engines for economic and social development. Meanwhile, startups are heavily constrained by many factors such as limited financial funding and human resources. Therefore, the chance for a startup to eventually succeed is as rare as "spotting a unicorn in the wild". Venture Capital (VC) strives to identify and invest in unicorn startups during their early stages, hoping to gain a high return. To avoid entirely relying on human domain expertise and intuition, investors usually employ data-driven approaches to forecast the success probability of startups. Over the past two decades, the industry has gone through a paradigm shift moving from conventional statistical approaches towards becoming machine-learning (ML) based. Notably, the rapid growth of data volume and variety is quickly ushering in deep learning (DL), a subset of ML, as a potentially superior approach in terms of capacity and expressivity. In this work, we carry out a literature review and synthesis on DL-based approaches, covering the entire DL life cycle. The objective is a) to obtain a thorough and in-depth understanding of the methodologies for startup evaluation using DL, and b) to distil valuable and actionable learning for practitioners. To the best of our knowledge, our work is the first of this kind.

Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

TL;DR

This work targets the challenge of predicting unicorn startups in early-stage VC by synthesizing deep learning approaches across a nine-task lifecycle, from problem framing to deployment. It advocates a practical pipeline that emphasizes multi-modal and extrinsic data, investor-centric data splits, data balance and imputation strategies, and portfolio-based evaluation to reflect real-world ROI. A strong emphasis is placed on explainability (global and instance-level) and keeping humans in the loop to enhance trust and adapt to changing investment criteria. The paper provides actionable guidance for practitioners, highlighting compatibility filtering before prediction, simple yet effective data preprocessing, and model-agnostic explanations, with implications for developing transparent, scalable, and ROI-informed DL systems for startup sourcing.

Abstract

Startups often represent newly established business models associated with disruptive innovation and high scalability. They are commonly regarded as powerful engines for economic and social development. Meanwhile, startups are heavily constrained by many factors such as limited financial funding and human resources. Therefore, the chance for a startup to eventually succeed is as rare as "spotting a unicorn in the wild". Venture Capital (VC) strives to identify and invest in unicorn startups during their early stages, hoping to gain a high return. To avoid entirely relying on human domain expertise and intuition, investors usually employ data-driven approaches to forecast the success probability of startups. Over the past two decades, the industry has gone through a paradigm shift moving from conventional statistical approaches towards becoming machine-learning (ML) based. Notably, the rapid growth of data volume and variety is quickly ushering in deep learning (DL), a subset of ML, as a potentially superior approach in terms of capacity and expressivity. In this work, we carry out a literature review and synthesis on DL-based approaches, covering the entire DL life cycle. The objective is a) to obtain a thorough and in-depth understanding of the methodologies for startup evaluation using DL, and b) to distil valuable and actionable learning for practitioners. To the best of our knowledge, our work is the first of this kind.
Paper Structure (17 sections, 17 figures)

This paper contains 17 sections, 17 figures.

Figures (17)

  • Figure 1: High-level overview of ML (machine learning) based startup sourcing The ML model is trained to approximate a function $f(\cdot)$ so that the input data $\mathbf{x}$ describing a startup can be mapped to an output variable $y$ indicating the recommended investment propensity that can be either discrete (good vs. bad) or continuous (success probability).
  • Figure 2: DL (deep learning) utilizes ANNs (artificial neural networks) with at least two hidden layers; thus (a) is not considered as a DL model in this work. The input data $\mathbf{x}$ is fed into the input layer before flowing through the hidden layers. The output layer generates the final prediction $y$. The connections (fully or partly connected) between the adjacent layers carry trainable weights.
  • Figure 3: Two ways of addressing the probability of startup success and compatibility. Most DL-based work do not explicitly consider startup success and compatibility at the same time. Two feasible solutions are presented here. We recommend solution (b) over (a) due to its simplicity, flexibility, and closer approximation to real use cases.
  • Figure 4: Summary of the adopted criteria to evaluate startup success. (a) shows the percentage of each success criterion sorted by the their occurrences. (b) shows the percentage of combining different number of criteria together.
  • Figure 5: Summary of the used categories of input data by surveyed work. (a) shows the percentage of each data category (detailed in Section \ref{['sec:walkthrough-data-category']}) sorted by the their occurrences. (b) shows a snapshot (to the date when this paper is written) of the utilized data modalities: numerical, categorical, text, graph, time-series, image, video and audio.
  • ...and 12 more figures