Table of Contents
Fetching ...

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, Chao Huang

TL;DR

<3-5 sentence high-level summary> The paper addresses the lack of robust benchmarks for autonomous LLM agents in live financial markets and introduces AI-Trader, a fully autonomous, real-time, data-uncontaminated benchmarking framework spanning U.S. stocks, A-shares, and cryptocurrencies, with hourly and daily trading granularities. It evaluates six mainstream LLM backbones using a minimal-information paradigm and a standardized MCP-based toolchain to enable end-to-end live trading, information retrieval, and decision execution. Key findings show that general intelligence does not automatically yield profitable trading; risk-control capabilities drive cross-market robustness, and liquidity affects excess returns, with notable cross-market generalization limitations and crypto-specific dynamics revealed through case analyses. These results underscore the need for improved risk-aware, adaptive strategies in autonomous trading agents and provide a foundation for future, more robust live-finance benchmarks and methodologies.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential as autonomous agents, approaching human-expert performance through advanced reasoning and tool orchestration. However, decision-making in fully dynamic and live environments remains highly challenging, requiring real-time information integration and adaptive responses. While existing efforts have explored live evaluation mechanisms in structured tasks, a critical gap remains in systematic benchmarking for real-world applications, particularly in finance where stringent requirements exist for live strategic responsiveness. To address this gap, we introduce AI-Trader, the first fully-automated, live, and data-uncontaminated evaluation benchmark for LLM agents in financial decision-making. AI-Trader spans three major financial markets: U.S. stocks, A-shares, and cryptocurrencies, with multiple trading granularities to simulate live financial environments. Our benchmark implements a revolutionary fully autonomous minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market information without human intervention. We evaluate six mainstream LLMs across three markets and multiple trading frequencies. Our analysis reveals striking findings: general intelligence does not automatically translate to effective trading capability, with most agents exhibiting poor returns and weak risk management. We demonstrate that risk control capability determines cross-market robustness, and that AI trading strategies achieve excess returns more readily in highly liquid markets than policy-driven environments. These findings expose critical limitations in current autonomous agents and provide clear directions for future improvements. The code and evaluation data are open-sourced to foster community research: https://github.com/HKUDS/AI-Trader.

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

TL;DR

<3-5 sentence high-level summary> The paper addresses the lack of robust benchmarks for autonomous LLM agents in live financial markets and introduces AI-Trader, a fully autonomous, real-time, data-uncontaminated benchmarking framework spanning U.S. stocks, A-shares, and cryptocurrencies, with hourly and daily trading granularities. It evaluates six mainstream LLM backbones using a minimal-information paradigm and a standardized MCP-based toolchain to enable end-to-end live trading, information retrieval, and decision execution. Key findings show that general intelligence does not automatically yield profitable trading; risk-control capabilities drive cross-market robustness, and liquidity affects excess returns, with notable cross-market generalization limitations and crypto-specific dynamics revealed through case analyses. These results underscore the need for improved risk-aware, adaptive strategies in autonomous trading agents and provide a foundation for future, more robust live-finance benchmarks and methodologies.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential as autonomous agents, approaching human-expert performance through advanced reasoning and tool orchestration. However, decision-making in fully dynamic and live environments remains highly challenging, requiring real-time information integration and adaptive responses. While existing efforts have explored live evaluation mechanisms in structured tasks, a critical gap remains in systematic benchmarking for real-world applications, particularly in finance where stringent requirements exist for live strategic responsiveness. To address this gap, we introduce AI-Trader, the first fully-automated, live, and data-uncontaminated evaluation benchmark for LLM agents in financial decision-making. AI-Trader spans three major financial markets: U.S. stocks, A-shares, and cryptocurrencies, with multiple trading granularities to simulate live financial environments. Our benchmark implements a revolutionary fully autonomous minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market information without human intervention. We evaluate six mainstream LLMs across three markets and multiple trading frequencies. Our analysis reveals striking findings: general intelligence does not automatically translate to effective trading capability, with most agents exhibiting poor returns and weak risk management. We demonstrate that risk control capability determines cross-market robustness, and that AI trading strategies achieve excess returns more readily in highly liquid markets than policy-driven environments. These findings expose critical limitations in current autonomous agents and provide clear directions for future improvements. The code and evaluation data are open-sourced to foster community research: https://github.com/HKUDS/AI-Trader.

Paper Structure

This paper contains 18 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The architecture of AI-Trader. In AI-Trader, all information must be acquired through tools, ensuring that the decisions and actions generated by the agent can be observed under fully autonomous behavior. We equip the agent with three mainstream trading environments: U.S. stocks, A-share stocks, and cryptocurrencies. Additionally, we provide five fundamental tools that enable the agent not only to interact with these trading environments but also to perform computations via Bash, read local files to access stock data, and browse the web for general information.
  • Figure 2: Holdings change positions of each agent, top: US stocks, bottom: A-shares
  • Figure 3: Cumulative Return (CR) trajectories for all agents across three markets: U.S., A-Share, and Crypto. MiniMax-M2 and DeepSeek-v3.1 demonstrate superior performance in the U.S. market. Notably, MiniMax-M2 is the only consistently profitable agent in the A-Share market, while DeepSeek-v3.1 is the sole agent outperforming the baseline in the Crypto market.
  • Figure 4: The agent exhibits investment behaviors similar to those of humans. Left: Avoids a major market crash by gathering news and applying sound reasoning. Right: Makes emotionally driven investment decisions triggered by misleading news.
  • Figure A1: Trading performance metrics over time for all agents in both markets. Each subfigure contains four panels showing the evolution of: Cumulative Return (CR), Sortino Ratio (SR), Volatility (Vol), and Maximum Drawdown (MDD). MiniMax-M2 and DeepSeek-v3.1 demonstrate superior performance in the U.S. market, while MiniMax-M2 is the only consistently profitable agent in the A-Share market.