Table of Contents
Fetching ...

Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

Joohyoung Jeon, Hongchul Lee

Abstract

For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents--anonymizing all identifiers--and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024--2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.

Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimization

Abstract

For LLM trading agents to be genuinely trustworthy, they must demonstrate understanding of market dynamics rather than exploitation of memorized ticker associations. Building responsible multi-agent systems demands rigorous signal validation: proving that predictions reflect legitimate patterns, not pre-trained recall. We address two sources of spurious performance: memorization bias from ticker-specific pre-training, and survivorship bias from flawed backtesting. Our approach is to blindfold the agents--anonymizing all identifiers--and verify whether meaningful signals persist. BlindTrade anonymizes tickers and company names, and four LLM agents output scores along with reasoning. We construct a GNN graph from reasoning embeddings and trade using PPO-DSR policy. On 2025 YTD (through 2025-08-01), we achieved Sharpe 1.40 +/- 0.22 across 20 seeds and validated signal legitimacy through negative control experiments. To assess robustness beyond a single OOS window, we additionally evaluate an extended period (2024--2025), revealing market-regime dependency: the policy excels in volatile conditions but shows reduced alpha in trending bull markets.
Paper Structure (48 sections, 7 figures, 9 tables)

This paper contains 48 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Cumulative returns for 2025YTD OOS. Shaded band shows $\pm$1 std across 20 seeds.
  • Figure 2: The BlindTrade Pipeline: Data anonymization, Multi-agent LLM feature generation, IC validation, SemGAT encoding, Intent-conditioned RL (PPO-DSR), and Backtesting.
  • Figure 3: Leakage audit via negative control. When predictions are randomized, IC disappears and performance collapses.
  • Figure 4: Intent probability timeline across Train/Val/OOS periods. (a) Daily intent probabilities show how the policy adapts to market conditions. (b) Intent distribution remains stable across splits, demonstrating generalization.
  • Figure 5: Intent-conditioned policy behavior. (a) Defensive mode shows higher turnover (2.9%/day) for active rebalancing. (b-c) Max weight and concentration (Effective N) differ significantly by intent (Kruskal p=0.000).
  • ...and 2 more figures