Table of Contents
Fetching ...

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks

Xianzhi Li, Samuel Chan, Xiaodan Zhu, Yulong Pei, Zhiqiang Ma, Xiaomo Liu, Sameena Shah

TL;DR

This study benchmarks ChatGPT and GPT-4 across eight financial text analytics tasks, comparing them to fine-tuned finance models and BloombergGPT. The results reveal that while GPT-4 often outperforms ChatGPT and rivals domain-specific models on many tasks, especially QA and reasoning, it lags on structured prediction tasks like NER and RE where domain-specific fine-tuning remains superior. Prompting strategies, particularly Chain-of-Thought, significantly boost performance, yet purely generalist models still struggle for high-stakes, domain-heavy analyses. The work highlights both the promise and limitations of generalist LLMs in finance, suggesting a pragmatic role as decision-support tools alongside specialized models and human expertise.

Abstract

The most recent large language models(LLMs) such as ChatGPT and GPT-4 have shown exceptional capabilities of generalist models, achieving state-of-the-art performance on a wide range of NLP tasks with little or no adaptation. How effective are such models in the financial domain? Understanding this basic question would have a significant impact on many downstream financial analytical tasks. In this paper, we conduct an empirical study and provide experimental evidences of their performance on a wide variety of financial text analytical problems, using eight benchmark datasets from five categories of tasks. We report both the strengths and limitations of the current models by comparing them to the state-of-the-art fine-tuned approaches and the recently released domain-specific pretrained models. We hope our study can help understand the capability of the existing models in the financial domain and facilitate further improvements.

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks

TL;DR

This study benchmarks ChatGPT and GPT-4 across eight financial text analytics tasks, comparing them to fine-tuned finance models and BloombergGPT. The results reveal that while GPT-4 often outperforms ChatGPT and rivals domain-specific models on many tasks, especially QA and reasoning, it lags on structured prediction tasks like NER and RE where domain-specific fine-tuning remains superior. Prompting strategies, particularly Chain-of-Thought, significantly boost performance, yet purely generalist models still struggle for high-stakes, domain-heavy analyses. The work highlights both the promise and limitations of generalist LLMs in finance, suggesting a pragmatic role as decision-support tools alongside specialized models and human expertise.

Abstract

The most recent large language models(LLMs) such as ChatGPT and GPT-4 have shown exceptional capabilities of generalist models, achieving state-of-the-art performance on a wide range of NLP tasks with little or no adaptation. How effective are such models in the financial domain? Understanding this basic question would have a significant impact on many downstream financial analytical tasks. In this paper, we conduct an empirical study and provide experimental evidences of their performance on a wide variety of financial text analytical problems, using eight benchmark datasets from five categories of tasks. We report both the strengths and limitations of the current models by comparing them to the state-of-the-art fine-tuned approaches and the recently released domain-specific pretrained models. We hope our study can help understand the capability of the existing models in the financial domain and facilitate further improvements.
Paper Structure (36 sections, 12 figures, 8 tables)

This paper contains 36 sections, 12 figures, 8 tables.

Figures (12)

  • Figure 1: FinQA program steps analysis
  • Figure 2: Headlines few shot results curve
  • Figure 3: FiQA few shot results curve
  • Figure 4: PFB few shot results curve
  • Figure 5: TweetFinSent few shot results curve
  • ...and 7 more figures