Table of Contents
Fetching ...

Large Language Models for Predictive Analysis: How Far Are They?

Qin Chen, Yuanyi Ren, Xiaojun Ma, Yuyang Shi

TL;DR

This work tackles the lack of standardized benchmarks for evaluating large language models in predictive analysis by introducing the PredictiQ benchmark, which aggregates 1,130 data-specific queries from 44 real-world tabular datasets across eight domains. It formalizes the task as producing both textual justifications and executable code from a given dataset and query, and it evaluates 12 prominent LLMs using a three-domain, seven-aspect scoring protocol with GPT4Turbo as the primary human-alignment evaluator. Key findings show that code-fine-tuning can boost predictive-performance beyond what parameter size alone would suggest, that text and code generation are interdependent, and that model strength varies markedly across fields and contexts. The study highlights substantial room for improvement in predictive analysis, particularly regarding data preprocessing, depth of explanations, and efficiency, and it provides a rigorous framework to guide future LLM development and evaluation in data-driven decision support.

Abstract

Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{https://github.com/Cqkkkkkk/PredictiQ}{Github}.

Large Language Models for Predictive Analysis: How Far Are They?

TL;DR

This work tackles the lack of standardized benchmarks for evaluating large language models in predictive analysis by introducing the PredictiQ benchmark, which aggregates 1,130 data-specific queries from 44 real-world tabular datasets across eight domains. It formalizes the task as producing both textual justifications and executable code from a given dataset and query, and it evaluates 12 prominent LLMs using a three-domain, seven-aspect scoring protocol with GPT4Turbo as the primary human-alignment evaluator. Key findings show that code-fine-tuning can boost predictive-performance beyond what parameter size alone would suggest, that text and code generation are interdependent, and that model strength varies markedly across fields and contexts. The study highlights substantial room for improvement in predictive analysis, particularly regarding data preprocessing, depth of explanations, and efficiency, and it provides a rigorous framework to guide future LLM development and evaluation in data-driven decision support.

Abstract

Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{https://github.com/Cqkkkkkk/PredictiQ}{Github}.

Paper Structure

This paper contains 32 sections, 1 equation, 8 figures, 16 tables.

Figures (8)

  • Figure 1: An example of users conducting predictive analysis via Large Language Models.
  • Figure 2: Score distributions of LLMs on eight fields. For clarity we present the total scores of text, code, and their alignment.
  • Figure 3: Analysis on impact of context length limit.
  • Figure 4: Alignment scores of different evaluators with human experts. See \ref{['fig:human-eval-full']} for full results.
  • Figure 5: Expert evaluation against evaluation from LLMs.
  • ...and 3 more figures