Benchmarking LLM Agents for Wealth-Management Workflows
Rory Milsom
TL;DR
This work extends a workplace-benchmark framework (TheAgentCompany) with a finance-focused environment to evaluate general-purpose LLM agents performing wealth-management assistant tasks. It introduces a reproducible, tool-rich stack (EspoCRM, OwnCloud, Rocket.Chat, Plane) and a 24-task wealth-management benchmark with deterministic evaluators and high/low autonomy variants. The study finds that end-to-end workflow reliability and task autonomy substantially shape agent performance, with autonomy tweaks reducing drift while maintaining similar search effort; better evaluators reduce false positives and clarify failure modes. Results show meaningful gains over prior finance benchmarks and highlight persistent bottlenecks in authentication, cross-tool delivery, and delivery fidelity, informing future benchmark design and cost-aware evaluation in finance settings.
Abstract
Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.
