Table of Contents
Fetching ...

Beyond Classification: Financial Reasoning in State-of-the-Art Language Models

Guijin Son, Hanearl Jung, Moonjeong Hahm, Keonju Na, Sol Jin

TL;DR

This paper investigates financial reasoning capabilities in large language models by introducing the Financial Investment Opinion Generation (FIOG) task and the In-Context Question Answering (ICQA) prompting method. It provides the sFIOG dataset with 11,802 synthetic investment-thesis samples and benchmarks GPT variants from 2.8B to 13B with and without instruction tuning. The results show coherent financial reasoning first emerges around 6B parameters and improves with larger data and instruction tuning, while ICQA yields more controlled outputs; however, LLM evaluators do not align well with human judgments for financial reasoning. The work offers a public benchmark and a valuable resource for advancing finance-focused reasoning in language models, highlighting directions for scaling, prompting, and evaluation in financial contexts.

Abstract

Large Language Models (LLMs), consisting of 100 billion or more parameters, have demonstrated remarkable ability in complex multi-step reasoning tasks. However, the application of such generic advancements has been limited to a few fields, such as clinical or legal, with the field of financial reasoning remaining largely unexplored. To the best of our knowledge, the ability of LLMs to solve financial reasoning problems has never been dealt with, and whether it can be performed at any scale remains unknown. To address this knowledge gap, this research presents a comprehensive investigation into the potential application of LLMs in the financial domain. The investigation includes a detailed exploration of a range of subjects, including task formulation, synthetic data generation, prompting methods, and evaluation capability. Furthermore, the study benchmarks various GPT variants with parameter scales ranging from 2.8B to 13B, with and without instruction tuning, on diverse dataset sizes. By analyzing the results, we reveal that the ability to generate coherent financial reasoning first emerges at 6B parameters, and continues to improve with better instruction-tuning or larger datasets. Additionally, the study provides a publicly accessible dataset named sFIOG (Synthetic-Financial Investment Opinion Generation), consisting of 11,802 synthetic investment thesis samples, to support further research in the field of financial reasoning. Overall, this research seeks to contribute to the understanding of the efficacy of language models in the field of finance, with a particular emphasis on their ability to engage in sophisticated reasoning and analysis within the context of investment decision-making.

Beyond Classification: Financial Reasoning in State-of-the-Art Language Models

TL;DR

This paper investigates financial reasoning capabilities in large language models by introducing the Financial Investment Opinion Generation (FIOG) task and the In-Context Question Answering (ICQA) prompting method. It provides the sFIOG dataset with 11,802 synthetic investment-thesis samples and benchmarks GPT variants from 2.8B to 13B with and without instruction tuning. The results show coherent financial reasoning first emerges around 6B parameters and improves with larger data and instruction tuning, while ICQA yields more controlled outputs; however, LLM evaluators do not align well with human judgments for financial reasoning. The work offers a public benchmark and a valuable resource for advancing finance-focused reasoning in language models, highlighting directions for scaling, prompting, and evaluation in financial contexts.

Abstract

Large Language Models (LLMs), consisting of 100 billion or more parameters, have demonstrated remarkable ability in complex multi-step reasoning tasks. However, the application of such generic advancements has been limited to a few fields, such as clinical or legal, with the field of financial reasoning remaining largely unexplored. To the best of our knowledge, the ability of LLMs to solve financial reasoning problems has never been dealt with, and whether it can be performed at any scale remains unknown. To address this knowledge gap, this research presents a comprehensive investigation into the potential application of LLMs in the financial domain. The investigation includes a detailed exploration of a range of subjects, including task formulation, synthetic data generation, prompting methods, and evaluation capability. Furthermore, the study benchmarks various GPT variants with parameter scales ranging from 2.8B to 13B, with and without instruction tuning, on diverse dataset sizes. By analyzing the results, we reveal that the ability to generate coherent financial reasoning first emerges at 6B parameters, and continues to improve with better instruction-tuning or larger datasets. Additionally, the study provides a publicly accessible dataset named sFIOG (Synthetic-Financial Investment Opinion Generation), consisting of 11,802 synthetic investment thesis samples, to support further research in the field of financial reasoning. Overall, this research seeks to contribute to the understanding of the efficacy of language models in the field of finance, with a particular emphasis on their ability to engage in sophisticated reasoning and analysis within the context of investment decision-making.
Paper Structure (20 sections, 4 figures, 4 tables)

This paper contains 20 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Qualitative Evaluation of Collected Investment Theses: Green denotes expert-written, blue represents full-text type, and dark blue indicates Q&A type. G1 and G2 refer to GPT-4 answers for Question1 and Question2, respectively. H1 and H2 denote human answers for Question1 and Question2, respectively.
  • Figure 2: Left for Q1, Right for Q2.
  • Figure 3: Performance of Vicuna across varying training steps. The x-axis denotes the training step, presented in the format sample_size(epoch). The y-axis displays the corresponding ROUGE-L scores.
  • Figure 4: Human preference on generated samples. Dark Blue for LLama, Green for Galactica, Blue for GPT-J, and Yellow for Pytha(2.8B)