Table of Contents
Fetching ...

Enhancing Startup Success Predictions in Venture Capital: A GraphRAG Augmented Multivariate Time Series Method

Zitian Gao, Yihao Xiao

TL;DR

The paper tackles startup success prediction in venture capital under data scarcity by integrating inter-company relations through GraphRAG into a multivariate time-series framework. It introduces a two-stage methodology where GraphRAG extracts a knowledge graph from unstructured text and converts it into a mask matrix via the Leiden algorithm to regularize a Seq2Seq LSTM model, enabling long-horizon predictions. Using a sequence-to-sequence formulation with input $X_n$ (5–10 years of multivariate features) and output $Y_n$ representing the post-IPO $P/B$ ratio across quarters, the approach captures dynamic, multi-dimensional signals beyond binary outcomes. Experiments on Chinese datasets show significant improvements over baselines, including about a 16% gain in $R^2$, with ablation studies confirming the value of relational data and the graph-based regularizer for robust performance in sparse data conditions.

Abstract

In the Venture Capital (VC) industry, predicting the success of startups is challenging due to limited financial data and the need for subjective revenue forecasts. Previous methods based on time series analysis often fall short as they fail to incorporate crucial inter-company relationships such as competition and collaboration. To fill the gap, this paper aims to introduce a novel approach using GraphRAG augmented time series model. With GraphRAG, time series predictive methods are enhanced by integrating these vital relationships into the analysis framework, allowing for a more dynamic understanding of the startup ecosystem in venture capital. Our experimental results demonstrate that our model significantly outperforms previous models in startup success predictions.

Enhancing Startup Success Predictions in Venture Capital: A GraphRAG Augmented Multivariate Time Series Method

TL;DR

The paper tackles startup success prediction in venture capital under data scarcity by integrating inter-company relations through GraphRAG into a multivariate time-series framework. It introduces a two-stage methodology where GraphRAG extracts a knowledge graph from unstructured text and converts it into a mask matrix via the Leiden algorithm to regularize a Seq2Seq LSTM model, enabling long-horizon predictions. Using a sequence-to-sequence formulation with input (5–10 years of multivariate features) and output representing the post-IPO ratio across quarters, the approach captures dynamic, multi-dimensional signals beyond binary outcomes. Experiments on Chinese datasets show significant improvements over baselines, including about a 16% gain in , with ablation studies confirming the value of relational data and the graph-based regularizer for robust performance in sparse data conditions.

Abstract

In the Venture Capital (VC) industry, predicting the success of startups is challenging due to limited financial data and the need for subjective revenue forecasts. Previous methods based on time series analysis often fall short as they fail to incorporate crucial inter-company relationships such as competition and collaboration. To fill the gap, this paper aims to introduce a novel approach using GraphRAG augmented time series model. With GraphRAG, time series predictive methods are enhanced by integrating these vital relationships into the analysis framework, allowing for a more dynamic understanding of the startup ecosystem in venture capital. Our experimental results demonstrate that our model significantly outperforms previous models in startup success predictions.
Paper Structure (16 sections, 6 equations, 2 figures, 3 tables)

This paper contains 16 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Our proposed method framework overview: The upper part of the framework represents the process of extracting a knowledge graph from structured text using GraphRAG, followed by transforming it into a mask matrix using the Leiden algorithm. The lower part represents our multivariate sequence-to-sequence LSTM model Mou.
  • Figure 2: The knowledge graph generated using GraphRAG and the Leiden algorithm, illustrated using OpenORD openord, different colors represent different industries (community clusters).