SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Yuchen Cao; Hanlin Zhang; Jacky Wai Keung; Yang Chen; Linqi Song

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Yuchen Cao, Hanlin Zhang, Jacky Wai Keung, Yang Chen, Linqi Song

Abstract

Large language models (LLMs) are increasingly used as quantitative research copilots to translate natural-language strategy specifications into executable trading code. Yet most existing evaluations either focus on static financial knowledge or summarize performance with a single profitability metric, leaving a gap for benchmarking strategy-to-code trading systems as governed, auditable software. We introduce SysTradeBench (SysTB), an iterative build-test-patch benchmark that evaluates LLM-generated trading systems under drift-aware diagnostics. Given a standardized Base Strategy Doc and frozen semantics, each model must produce (i) a strategy card, (ii) executable code, and (iii) mandatory audit logs. A sandboxed harness runs determinism and anti-leakage checks, detects rule drift across iterations, and returns evidence bundles to support constrained patches. SysTradeBench reports multi-dimensional scorecards for spec fidelity, risk discipline, reliability, and out-of-sample robustness indicators, together with cost-effectiveness signals. We evaluate 17 models across 12 strategies. Top models achieve validity above 91.7 percent with strong aggregate scores, but evidence-driven iteration also induces code convergence by Iter2. These findings suggest that LLM iteration complements rather than replaces human quantitative researcher governance: LLMs excel at rapid prototyping and shallow bug fixes, while human oversight remains essential for critical strategies requiring solution diversity and ensemble robustness.

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Abstract

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Abstract

Paper Structure

Table of Contents

Figures (5)