StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Yanxu Chen; Zijun Yao; Yantao Liu; Jin Ye; Jianing Yu; Lei Hou; Juanzi Li

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li

TL;DR

StockBench introduces a contamination-free, multi-month stock-trading benchmark that places LLM-backed agents in a realistic back-trading environment with daily market signals. The framework emphasizes realism, continuous decision-making, and fair data separation, and uses financial metrics such as final return, maximum drawdown, and the Sortino ratio for evaluation. Across diverse LLM backbones, most agents do not consistently beat simple buy-and-hold strategies, though several approaches achieve profitable trading and improved risk management, underscoring the gap between static financial reasoning and dynamic market decision-making. An open-source release of StockBench is offered to advance reproducibility and future research in AI-driven financial agents.

Abstract

Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

TL;DR

Abstract

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)