Table of Contents
Fetching ...

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou

TL;DR

This work presents AMA, the first lifelong, real-time, multi-asset benchmark for evaluating LLM-based trading agents in live markets using verified data streams. The framework combines Market Intelligence Stream, Agent Execution Protocol, and Performance Analytics Interface to enable fair, continuous comparisons across diverse agent architectures and LLM backbones. Key findings show that agent architecture largely governs performance and risk behavior, while backbone changes have comparatively smaller effects, with memory-based reasoning and varied trading styles shaping adaptability and profitability. The benchmark offers a transparent, evolving platform for rigorous study of financial reasoning and trading intelligence in LLM-driven systems, with practical implications for designing robust autonomous trading agents across assets.

Abstract

Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

TL;DR

This work presents AMA, the first lifelong, real-time, multi-asset benchmark for evaluating LLM-based trading agents in live markets using verified data streams. The framework combines Market Intelligence Stream, Agent Execution Protocol, and Performance Analytics Interface to enable fair, continuous comparisons across diverse agent architectures and LLM backbones. Key findings show that agent architecture largely governs performance and risk behavior, while backbone changes have comparatively smaller effects, with memory-based reasoning and varied trading styles shaping adaptability and profitability. The benchmark offers a transparent, evolving platform for rigorous study of financial reasoning and trading intelligence in LLM-driven systems, with practical implications for designing robust autonomous trading agents across assets.

Abstract

Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.

Paper Structure

This paper contains 21 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overall framework of Agent Market Arena.
  • Figure 2: Aggregated performance of different agents and LLMs across four assets. The solid line represents the average cumulative return, while the shaded area indicates the range between the maximum and minimum CR observed when switching LLMs or agent frameworks.
  • Figure 3: Agent performance on BTC under different market events. The bars represent the profit gap between TradeAgent (Gemini-2.0-flash) and InvestorAgent (GPT-4.1).
  • Figure 4: Cumulative return comparison of Buy-and-Hold baseline for BTC (Aug–Sep 2025), annotated with daily news sentiment (green = bullish, gray = neutral, red = bearish) and agent voting signals (squares: blue = BUY, gray = HOLD, orange = SELL.
  • Figure 5: Agent Market Arena Dashboard (Overview View). The interface provides a unified view for monitoring, comparing, and analyzing the real-time performance of LLM-based trading agents across assets, models, and strategies.
  • ...and 1 more figures