SSFF: Investigating LLM Predictive Capabilities for Startup Success through a Multi-Agent Framework with Enhanced Explainability and Performance
Xisen Wang, Yigit Ihlamur, Fuat Alican
TL;DR
The paper tackles the challenge of predicting startup success with large language models (LLMs), revealing an over-prediction bias when LLMs rely on founder claims in data-scarce settings. It proposes the Startup Success Forecasting Framework (SSFF), a hybrid, multi-agent system that fuses traditional machine learning (e.g., random forests, neural networks) with LLM reasoning and retrieval-augmented knowledge to deliver data-driven startup evaluations. The framework features three blocks—prediction, analysis, and external knowledge—and demonstrates significant performance gains over baseline LLMs (e.g., 108.3% relative improvement over GPT-4o mini and 30.8% over GPT-4o), along with interpretable founder segmentation showing elite founders (L5) massively outperform others. Qualitative evaluations corroborate enhanced transparency and decision-support, while limitations include a small sample and remaining biases, motivating future work on broader model comparisons, larger datasets, and human-in-the-loop validation.
Abstract
LLM based agents have recently demonstrated strong potential in automating complex tasks, yet accurately predicting startup success remains an open challenge with few benchmarks and tailored frameworks. To address these limitations, we propose the Startup Success Forecasting Framework, an autonomous system that emulates the reasoning of venture capital analysts through a multi agent collaboration model. Our framework integrates traditional machine learning methods such as random forests and neural networks within a retrieval augmented generation framework composed of three interconnected modules: a prediction block, an analysis block, and an external knowledge block. We evaluate our framework and identify three main findings. First, by leveraging founder segmentation, startups led by L5 founders are 3.79 times more likely to succeed than those led by L1 founders. Second, baseline large language models consistently overpredict startup success and struggle under realistic class imbalances largely due to overreliance on founder claims. Third, our framework significantly enhances prediction accuracy, yielding a 108.3 percent relative improvement over GPT 4o mini and a 30.8 percent relative improvement over GPT 4o. These results demonstrate the value of a multi agent approach combined with discriminative machine learning in mitigating the limitations of standard large language model based prediction methods.
