Table of Contents
Fetching ...

INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent

Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Koduvayur Subbalakshmi, Guojun Xiong, Jimin Huang, Lingfei Qian, Xueqing Peng, Qianqian Xie, Jordan W. Suchow

TL;DR

InvestorBench addresses the lack of standardized benchmarks and adaptable frameworks for evaluating LLM-based financial decision-making across diverse tasks. It introduces an open-source benchmark with multi-source market data, three asset-class tasks, a memory-augmented LLM agent framework, and a unified evaluation protocol tested across 13 backbone models. Stock trading results show proprietary backbones generally outperform open-source and domain-tuned models, and memory-reflection mechanisms improve robustness in open-ended decision contexts. The platform provides a scalable, multi-modal testbed for rigorous comparison of financial reasoning by LLM agents, enabling faster development and deployment of robust decision-making tools.

Abstract

Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce \textsc{InvestorBench}, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, multi-modal datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents' performance across various scenarios.

INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent

TL;DR

InvestorBench addresses the lack of standardized benchmarks and adaptable frameworks for evaluating LLM-based financial decision-making across diverse tasks. It introduces an open-source benchmark with multi-source market data, three asset-class tasks, a memory-augmented LLM agent framework, and a unified evaluation protocol tested across 13 backbone models. Stock trading results show proprietary backbones generally outperform open-source and domain-tuned models, and memory-reflection mechanisms improve robustness in open-ended decision contexts. The platform provides a scalable, multi-modal testbed for rigorous comparison of financial reasoning by LLM agents, enabling faster development and deployment of robust decision-making tools.

Abstract

Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce \textsc{InvestorBench}, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, multi-modal datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents' performance across various scenarios.

Paper Structure

This paper contains 21 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: General architecture of InvestorBench.
  • Figure 2: (1) The language agent's memory module is crafted to interact with the market environment to conduct various financial decision-making tasks. It contains two core components -- Working Memory and Layered Long-term Memory. (2) The outline of the agent's decision-making workflow for retrieving critical memory events and market observations to inform specific investment decisions.
  • Figure 3: Agent Performance Comparisons from two key perspectives. The CR, SR, AV, and MDD represent the average values for each model type, expressed as a percentage relative to the Buy & Hold strategy.
  • Figure 4: First section of FinMem's workflow for perceiving and processing multi-sourced information from market environment.
  • Figure 5: Second section of FinMem's workflow for generating trading action, reasoning and reflection.