Table of Contents
Fetching ...

Benchmarking LLM Agents for Wealth-Management Workflows

Rory Milsom

TL;DR

This work extends a workplace-benchmark framework (TheAgentCompany) with a finance-focused environment to evaluate general-purpose LLM agents performing wealth-management assistant tasks. It introduces a reproducible, tool-rich stack (EspoCRM, OwnCloud, Rocket.Chat, Plane) and a 24-task wealth-management benchmark with deterministic evaluators and high/low autonomy variants. The study finds that end-to-end workflow reliability and task autonomy substantially shape agent performance, with autonomy tweaks reducing drift while maintaining similar search effort; better evaluators reduce false positives and clarify failure modes. Results show meaningful gains over prior finance benchmarks and highlight persistent bottlenecks in authentication, cross-tool delivery, and delivery fidelity, informing future benchmark design and cost-aware evaluation in finance settings.

Abstract

Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.

Benchmarking LLM Agents for Wealth-Management Workflows

TL;DR

This work extends a workplace-benchmark framework (TheAgentCompany) with a finance-focused environment to evaluate general-purpose LLM agents performing wealth-management assistant tasks. It introduces a reproducible, tool-rich stack (EspoCRM, OwnCloud, Rocket.Chat, Plane) and a 24-task wealth-management benchmark with deterministic evaluators and high/low autonomy variants. The study finds that end-to-end workflow reliability and task autonomy substantially shape agent performance, with autonomy tweaks reducing drift while maintaining similar search effort; better evaluators reduce false positives and clarify failure modes. Results show meaningful gains over prior finance benchmarks and highlight persistent bottlenecks in authentication, cross-tool delivery, and delivery fidelity, informing future benchmark design and cost-aware evaluation in finance settings.

Abstract

Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.

Paper Structure

This paper contains 41 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of TAC Architecturexu2024theagentcompany
  • Figure 2: Environment architecture overview. The OpenHands Agent Runtime orchestrates OwnCloud, EspoCRM, Rocket.Chat, and Plane via APIs; Sotopia injects agent capabilities into Rocket.Chat; outputs are validated by Evaluators.
  • Figure 3: Experiment 1: % Checkpoints passed. Left: Original TAC tasks (12). Right: New tasks (high autonomy, 12).
  • Figure 4: Experiment 1: Cost distribution. Left: Original TAC tasks (12). Right: New tasks (high autonomy, 12).
  • Figure 5: Experiment 1: Step-counts. Left: Original TAC tasks (12). Right: New tasks (high autonomy, 12).
  • ...and 3 more figures