Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

Lele Cao; Vilhelm von Ehrenheim; Sebastian Krakowski; Xiaoxue Li; Alexandra Lutz

Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

Lele Cao, Vilhelm von Ehrenheim, Sebastian Krakowski, Xiaoxue Li, Alexandra Lutz

TL;DR

This work targets the challenge of predicting unicorn startups in early-stage VC by synthesizing deep learning approaches across a nine-task lifecycle, from problem framing to deployment. It advocates a practical pipeline that emphasizes multi-modal and extrinsic data, investor-centric data splits, data balance and imputation strategies, and portfolio-based evaluation to reflect real-world ROI. A strong emphasis is placed on explainability (global and instance-level) and keeping humans in the loop to enhance trust and adapt to changing investment criteria. The paper provides actionable guidance for practitioners, highlighting compatibility filtering before prediction, simple yet effective data preprocessing, and model-agnostic explanations, with implications for developing transparent, scalable, and ROI-informed DL systems for startup sourcing.

Abstract

Startups often represent newly established business models associated with disruptive innovation and high scalability. They are commonly regarded as powerful engines for economic and social development. Meanwhile, startups are heavily constrained by many factors such as limited financial funding and human resources. Therefore, the chance for a startup to eventually succeed is as rare as "spotting a unicorn in the wild". Venture Capital (VC) strives to identify and invest in unicorn startups during their early stages, hoping to gain a high return. To avoid entirely relying on human domain expertise and intuition, investors usually employ data-driven approaches to forecast the success probability of startups. Over the past two decades, the industry has gone through a paradigm shift moving from conventional statistical approaches towards becoming machine-learning (ML) based. Notably, the rapid growth of data volume and variety is quickly ushering in deep learning (DL), a subset of ML, as a potentially superior approach in terms of capacity and expressivity. In this work, we carry out a literature review and synthesis on DL-based approaches, covering the entire DL life cycle. The objective is a) to obtain a thorough and in-depth understanding of the methodologies for startup evaluation using DL, and b) to distil valuable and actionable learning for practitioners. To the best of our knowledge, our work is the first of this kind.

Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

TL;DR

Abstract

Paper Structure (17 sections, 17 figures)

This paper contains 17 sections, 17 figures.

Introduction
Avoid Predicting Success and Compatibility Simultaneously
Clearly Define the Success Criteria of Startups
Use Multi-modal, Unstructured, Free and Extrinsic Data
A detailed walk-through of each data category
Several noticeable trends in data selection
Address the Problems of Data Imbalance and Sparsity
Balance the dataset with augmentation or PU-learning
Densify sparse input with simple imputation techniques
Split the Dataset with an Investor-Centric View
Company-centric vs. investor-centric
Understand the data generation process
Model Selection: Occam's Razor and No-Free-Lunch
Evaluate Model with Precision-First and Simulation Mindset
Resort to Model-Agnostic and Instance-Level Explainability
...and 2 more sections

Figures (17)

Figure 1: High-level overview of ML (machine learning) based startup sourcing The ML model is trained to approximate a function $f(\cdot)$ so that the input data $\mathbf{x}$ describing a startup can be mapped to an output variable $y$ indicating the recommended investment propensity that can be either discrete (good vs. bad) or continuous (success probability).
Figure 2: DL (deep learning) utilizes ANNs (artificial neural networks) with at least two hidden layers; thus (a) is not considered as a DL model in this work. The input data $\mathbf{x}$ is fed into the input layer before flowing through the hidden layers. The output layer generates the final prediction $y$. The connections (fully or partly connected) between the adjacent layers carry trainable weights.
Figure 3: Two ways of addressing the probability of startup success and compatibility. Most DL-based work do not explicitly consider startup success and compatibility at the same time. Two feasible solutions are presented here. We recommend solution (b) over (a) due to its simplicity, flexibility, and closer approximation to real use cases.
Figure 4: Summary of the adopted criteria to evaluate startup success. (a) shows the percentage of each success criterion sorted by the their occurrences. (b) shows the percentage of combining different number of criteria together.
Figure 5: Summary of the used categories of input data by surveyed work. (a) shows the percentage of each data category (detailed in Section \ref{['sec:walkthrough-data-category']}) sorted by the their occurrences. (b) shows a snapshot (to the date when this paper is written) of the utilized data modalities: numerical, categorical, text, graph, time-series, image, video and audio.
...and 12 more figures

Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

TL;DR

Abstract

Using Deep Learning to Find the Next Unicorn: A Practical Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (17)