Table of Contents
Fetching ...

Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks

Antoine Bigeard, Langston Nashold, Rayan Krishnan, Shirley Wu

TL;DR

This paper introduces the Finance Agent Benchmark, a comprehensive, expert-curated evaluation framework for LLM-based finance agents that operate on real-world public filings via live data sources like EDGAR. It combines 537 questions across nine task categories with a rubric-based, LLM-as-judge evaluation, and a model-agnostic harness that provides Google search and EDGAR access to test autonomous, tool-augmented reasoning. The results reveal substantial gaps in current AI capabilities, with the best model reaching only 46.8% accuracy while offering faster, cost-efficient analyses compared to human experts. The work establishes a rigorous benchmark for tracking progress in finance-focused agents and outlines future directions, including better handling of structured data and broader data access, supported by open-source tooling for reproducibility.

Abstract

Artificial Intelligence (AI) technology has emerged as a transformative force in financial analysis and the finance industry, though significant questions remain about the full capabilities of Large Language Model (LLM) agents in this domain. We present the Finance Agent Benchmark, featuring challenging and diverse real-world finance research problems that require LLMs to perform complex analysis using recent SEC filings. We construct the benchmark using a taxonomy of nine financial task categories, developed in consultation with experts from banks, hedge funds, and private equity firms. The dataset includes 537 expert-authored questions covering tasks from information retrieval to complex financial modeling, each validated through a rigorous review process to ensure accuracy and relevance. Moreover, we implement an agentic harness that equips LLMs with tools sufficient to produce accurate responses, including Google Search and EDGAR database access. Overall, the Finance Agent Benchmark provides a comprehensive testbed for measuring the progress of LLM-driven finance agents. Our evaluation reveals significant limitations in current AI capabilities - even the best-performing model (OpenAI o3) achieved only 46.8% accuracy at an average cost of $3.79 per query. This underscores the need for further advancements before reliable deployment in high-stakes finance settings.

Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks

TL;DR

This paper introduces the Finance Agent Benchmark, a comprehensive, expert-curated evaluation framework for LLM-based finance agents that operate on real-world public filings via live data sources like EDGAR. It combines 537 questions across nine task categories with a rubric-based, LLM-as-judge evaluation, and a model-agnostic harness that provides Google search and EDGAR access to test autonomous, tool-augmented reasoning. The results reveal substantial gaps in current AI capabilities, with the best model reaching only 46.8% accuracy while offering faster, cost-efficient analyses compared to human experts. The work establishes a rigorous benchmark for tracking progress in finance-focused agents and outlines future directions, including better handling of structured data and broader data access, supported by open-source tooling for reproducibility.

Abstract

Artificial Intelligence (AI) technology has emerged as a transformative force in financial analysis and the finance industry, though significant questions remain about the full capabilities of Large Language Model (LLM) agents in this domain. We present the Finance Agent Benchmark, featuring challenging and diverse real-world finance research problems that require LLMs to perform complex analysis using recent SEC filings. We construct the benchmark using a taxonomy of nine financial task categories, developed in consultation with experts from banks, hedge funds, and private equity firms. The dataset includes 537 expert-authored questions covering tasks from information retrieval to complex financial modeling, each validated through a rigorous review process to ensure accuracy and relevance. Moreover, we implement an agentic harness that equips LLMs with tools sufficient to produce accurate responses, including Google Search and EDGAR database access. Overall, the Finance Agent Benchmark provides a comprehensive testbed for measuring the progress of LLM-driven finance agents. Our evaluation reveals significant limitations in current AI capabilities - even the best-performing model (OpenAI o3) achieved only 46.8% accuracy at an average cost of $3.79 per query. This underscores the need for further advancements before reliable deployment in high-stakes finance settings.

Paper Structure

This paper contains 44 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Cost-Accuracy pareto curve results on Finance Agent Benchmark. The Finance Agent Benchmark reveals a clear logarithmic relationship between accuracy and cost, with a sharp diminishing return beyond $1 USD per question—highlighting that even today's most sophisticated models struggle to achieve greater than 50% accuracy on real-world financial tasks.
  • Figure 2: Architecture of the Finance Agent Benchmark. The framework features a structured evaluation process with four key steps: (1) Data Creation: experts identify practical and common financial questions requiring access to public financial documents and provide reference answers. (2) Rubric Development: expert-generated data is used to create robust rubrics with expected calculations and reasoning steps for standardized LLM evaluation. (3) Agent Evaluation: questions are processed through LLMs equipped with necessary tools to generate answers. (4) Answer Grading: an LLM-as-judge scoring system using LLM-as-judge applies conjunction rules to determine correctness across multiple criteria.
  • Figure 3: Financial Agent Harness architecture showing the interaction flow between LLM components, specialized tools, and information sources.
  • Figure 4: Overall tool use analysis. The best models not only better understand how to use the tools, they also tend to search deeper and be more persistent before settling on an answer. GPT 4o Mini is an outlier in this behavior, with a high number of unsuccessful tool calls.
  • Figure 5: Agent trajectories for various models. On each subplot, the numbers above the row are the turn index, and the numbers below are the number of tokens used for the turn (in thousands).
  • ...and 12 more figures