Table of Contents
Fetching ...

LiveTradeBench: Seeking Real-World Alpha with Large Language Models

Haofei Yu, Fenghai Li, Jiaxuan You

TL;DR

LiveTradeBench introduces a live, multi-market environment for evaluating LLM-based agents in portfolio management under real-time market dynamics. By streaming live prices and news and unifying trading decisions into a portfolio-allocation action on the simplex, the framework tests sequential decision making across two markets: U.S. stocks and Polymarket prediction markets. Evaluations of 21 LLMs over 50 trading days reveal that high general reasoning performance does not guarantee trading success, that models exhibit distinct risk appetites and allocation styles, and that some models can effectively leverage live signals to adapt decisions. These findings highlight a gap between static benchmark proficiency and real-world financial competence, and they establish LiveTradeBench as a foundation for developing more adaptive, financially grounded, and socially intelligent LLM-based trading agents.

Abstract

Large language models (LLMs) achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

LiveTradeBench: Seeking Real-World Alpha with Large Language Models

TL;DR

LiveTradeBench introduces a live, multi-market environment for evaluating LLM-based agents in portfolio management under real-time market dynamics. By streaming live prices and news and unifying trading decisions into a portfolio-allocation action on the simplex, the framework tests sequential decision making across two markets: U.S. stocks and Polymarket prediction markets. Evaluations of 21 LLMs over 50 trading days reveal that high general reasoning performance does not guarantee trading success, that models exhibit distinct risk appetites and allocation styles, and that some models can effectively leverage live signals to adapt decisions. These findings highlight a gap between static benchmark proficiency and real-world financial competence, and they establish LiveTradeBench as a foundation for developing more adaptive, financially grounded, and socially intelligent LLM-based trading agents.

Abstract

Large language models (LLMs) achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

Paper Structure

This paper contains 74 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Market selection in LiveTradeBench. The top panels show AAPL in the U.S. stock market (left) and the contract “OpenAI has the best AI model by the end of 2025” in the Polymarket prediction market (right). In prediction markets, the price directly reflects the probability of a given outcome. Both markets respond to news and historical price trends, but Polymarket exhibits sharper fluctuations, faster reactions, and higher sensitivity to external signals. The bottom panels display representative assets across various domains, including technology, finance, cryptocurrency, manufacturing, and politics.
  • Figure 2: Observation and action space for LiveTradeBench. We illustrate examples from both the U.S. stock market and the Polymarket prediction market to demonstrate the observation and action spaces. The observation space consists of three components: the agent’s position, market prices, and relevant news context. The action space represents the portfolio allocation decisions generated by the agent, which can be directly translated into executable trading actions.
  • Figure 3: Agent and environment framework in LiveTradeBench. The left side illustrates the simulated environment, which continuously retrieves real-world market prices and news, updating its internal state accordingly. It also adjusts the agent’s portfolio position based on the executed actions. The right side depicts the portfolio-management agent, equipped with analytical tools to process observations from the environment. The agent maintains a memory of past observations, enabling adaptive and context-aware decision-making.
  • Figure 4: Correlation between LMArena score and Sharpe ratio across two markets. (left) U.S. stock market. (right) Polymarket prediction market. Models from different families are shown in different colors, and the dashed line indicates the linear regression fit.
  • Figure 5: Rolling $k$-delta analysis on U.S. stocks. We evaluate rebalance intervals $k \in \{1, 2, 4, 8, 16\}$. The black line denotes the mean performance across 21 models, and the shaded gray region indicates the 25–75% confidence interval.
  • ...and 10 more figures