TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents

Minghan Li; Rachel Gonsalves; Weiyue Li; Sunghoon Yoon; Mengyu Wang

TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents

Minghan Li, Rachel Gonsalves, Weiyue Li, Sunghoon Yoon, Mengyu Wang

Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents in financial trading. However, they often exhibit a hazardous behavioral bias that we term uniform trust, whereby retrieved information is implicitly assumed to be factual and heterogeneous sources are treated as equally informative. This assumption stands in sharp contrast to human decision-making, which relies on selective filtering, cross-validation, and experience-driven weighting of information sources. As a result, LLM-based trading systems are particularly vulnerable to multi-source noise and misinformation, amplifying factual hallucinations and leading to unstable risk-return performance. To bridge this behavioral gap, we introduce TrustTrade (Trust-Rectified Unified Selective Trader), a multi-agent selective consensus framework inspired by human epistemic heuristics. TrustTrade replaces uniform trust with cross-agent consistency by aggregating information from multiple independent LLM agents and dynamically weighting signals based on their semantic and numerical agreement. Consistent signals are prioritized, while divergent, weakly grounded, or temporally inconsistent inputs are selectively discounted. To further stabilize decision-making, TrustTrade incorporates deterministic temporal signals as reproducible anchors and a reflective memory mechanism that adapts risk preferences at test time without additional training. Together, these components suppress noise amplification and hallucination-driven volatility, yielding more stable and risk-aware trading behavior. Across controlled backtesting in high-noise market environments (2024 Q1 and 2026 Q1), the proposed TrustTrade calibrates LLM trading behavior from extreme risk-return regimes toward a human-aligned, mid-risk and mid-return profile.

TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents

Abstract

Paper Structure (12 sections, 8 equations, 7 figures)

This paper contains 12 sections, 8 equations, 7 figures.

Results
Discussion
Methods
Data availability
Code availability

Figures (7)

Figure 1: Diagnosing instability in LLM-based trading agents (with GPT-4o-mini). This figure illustrates how data sources, reasoning depth, and allocation regimes jointly contribute to the instability and risk–return profiles of LLM trading agents. a, Schematic of a standard LLM-based trading pipeline xiao2025tradingagents with increasing reasoning depth, progressing from Analysts to Researcher, Trader, and Risk Manager. b-c, Cumulative returns and maximum drawdown across LLM agents under different data-source and reasoning-depth ablations. Return and risk vary across data sources, transitioning from high-risk/high-return to more balanced trade-offs with increasing reasoning depth. d-e, Risk--return heterogeneity across human and LLM trading agents under different allocation regimes. Average cumulative return (CR) and maximum drawdown (MDD) across stocks, with API cost per stock-day shown on the right axis. Full-allocation agents earn higher returns but suffer larger drawdowns than partial-allocation agents and human annotators. f-h, Stock price dynamics during 2024-Q1. Note that all reported results are averaged over these three stocks (AAPL, GOOG, NVDA) during 2024-Q1.
Figure 2: Behavioral signatures of human trading.a, Demographic profile of human annotators (n=19). b–d, Human trading outcomes across stocks, showing moderate cumulative returns, tightly controlled drawdowns, and near-neutral Sharpe ratios, with consistent risk exposure across assets. e-f, Risk–return and volatility profiles illustrate that human exhibits substantially greater dispersion across assets. g-h, Selective information weighting by human annotators: time allocation, confidence weighting and post-hoc influence/reliability are aligned, and the combined influence--reliability score emphasizes temporally grounded signals (price trends and market indices) over narrative-driven inputs (news and sentiment). i, Decision convergence across sequential information stages for human annotators and nine LLM-based traders under full- and partial-allocation settings. Human annotators show consistently high convergence to the final action across stages, whereas LLM traders exhibit lower and more variable convergence.
Figure 3: Human-aligned trading behavior induced by multi-agent consensus filtering, temporal signals, and memory bank with long/short-term decision reflection.a, Overview of the proposed TrustTrade framework: multiple agents collect information from diverse sources, a credibility scorer filters for high-consensus evidence, and the resulting decision is used to update a memory bank with both short- and long-term reflections. b, Decision convergence across sequential stages, comparing human annotators, baseline LLM traders, and the proposed TrustTrade. c, Risk--return trade-off averaged across stocks; the shaded ellipse marks the human-aligned preference region defined by the standard errors of human-annotator cumulative return (CR) and maximum drawdown (MDD). d, Newly introduced temporal-signal summary reporting deterministic, price-derived trends and indicators to reduce hallucination and improve decision reliability.
Figure 4: Comprehensive backtesting comparison across all baselines over the 2024 Q1 and 2026 Q1 trading period respectively.a & b, Risk--return trade-off averaged across NVDA, AAPL, and GOOG in 2024 Q1. Each point represents an agent configuration, plotted by average cumulative return (CR) and average maximum drawdown (MDD) / average Sharpe Ratio (SR). Human annotators (yellow star) define a human-aligned risk--return preference region (shaded ellipse). Full-allocation LLM agents (blue dots) achieve high returns but incur substantial drawdowns, whereas partial-allocation LLMs (red squares) reduce risk at the cost of diminished returns. Our TrustTrade builds on GPT-4o-mini and Grok-4 under partial allocation, and we compare two variants: without memory and reflection, the method achieves higher returns than human annotators at comparable maximum drawdown (approaching GPT-5 performance), while adding memory and reflection slightly reduces returns but further lowers risk. The Pareto frontier and linear trade-off fit are shown for reference. c & d, Day-by-day backtesting risk--return trade-off across NVDA, AAPL, and GOOG during 2026 Q1, comparing rule-based baselines, single-LLM traders, and our multi-LLM framework. Our TrustTrade achieves a substantially improved return--risk balance, with higher returns and lower risk than the comparison methods.
Figure 5: Daily real-time trading performance on AAPL, GOOG, and NVDA during 2026 Q1. This figure reports day-by-day results in a forward-time setting to reduce potential leakage from earlier market-period evaluation. Rule-based baselines show return swings that closely track price volatility, while pure LLM traders exhibit comparatively unstable behaviors and outcomes. By integrating multi-agent information with selective consensus, TrustTrade improves return performance with more stable trading trajectories.
...and 2 more figures

TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents

Abstract

TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents

Authors

Abstract

Table of Contents

Figures (7)