Using Deep Learning to Find the Next Unicorn: A Practical Synthesis
Lele Cao, Vilhelm von Ehrenheim, Sebastian Krakowski, Xiaoxue Li, Alexandra Lutz
TL;DR
This work targets the challenge of predicting unicorn startups in early-stage VC by synthesizing deep learning approaches across a nine-task lifecycle, from problem framing to deployment. It advocates a practical pipeline that emphasizes multi-modal and extrinsic data, investor-centric data splits, data balance and imputation strategies, and portfolio-based evaluation to reflect real-world ROI. A strong emphasis is placed on explainability (global and instance-level) and keeping humans in the loop to enhance trust and adapt to changing investment criteria. The paper provides actionable guidance for practitioners, highlighting compatibility filtering before prediction, simple yet effective data preprocessing, and model-agnostic explanations, with implications for developing transparent, scalable, and ROI-informed DL systems for startup sourcing.
Abstract
Startups often represent newly established business models associated with disruptive innovation and high scalability. They are commonly regarded as powerful engines for economic and social development. Meanwhile, startups are heavily constrained by many factors such as limited financial funding and human resources. Therefore, the chance for a startup to eventually succeed is as rare as "spotting a unicorn in the wild". Venture Capital (VC) strives to identify and invest in unicorn startups during their early stages, hoping to gain a high return. To avoid entirely relying on human domain expertise and intuition, investors usually employ data-driven approaches to forecast the success probability of startups. Over the past two decades, the industry has gone through a paradigm shift moving from conventional statistical approaches towards becoming machine-learning (ML) based. Notably, the rapid growth of data volume and variety is quickly ushering in deep learning (DL), a subset of ML, as a potentially superior approach in terms of capacity and expressivity. In this work, we carry out a literature review and synthesis on DL-based approaches, covering the entire DL life cycle. The objective is a) to obtain a thorough and in-depth understanding of the methodologies for startup evaluation using DL, and b) to distil valuable and actionable learning for practitioners. To the best of our knowledge, our work is the first of this kind.
