Table of Contents
Fetching ...

Implementation Risk in Portfolio Backtesting: A Previously Unquantified Source of Error

Dong Yin, Takeshi Miki, Vladislav Lesnichenko, Vasyl Gural

Abstract

Portfolio backtesting is the primary tool for evaluating investment strategies before deployment, yet practitioners implicitly assume that different engines produce identical results for the same strategy. we formalise implementation risk, the systematic divergence in backtested portfolio metrics arising solely from differences in how engines implement the same logical strategy, and propose four metrics grounded in metrology to quantify it: engine sensitivity, implementation uncertainty interval, divergence amplification factor, and conclusion stability index. we execute 15 benchmark strategies through five independent open-source engines on 30 non-overlapping stratified asset buckets comprising 180 s&p 500 stocks under four transaction-cost regimes. at zero cost, all five engines agree exactly (maximum divergence 0.000%), isolating transaction-cost implementation as the sole source of disagreement. under nonzero costs, divergence is structured and predictable (spearman rho = 0.93 with cost intensity), remaining below 0.75 percentage points for most strategies but reaching 3.71% for high-turnover rotation strategies. source-code forensics uncovered seven previously undocumented defects across three engines, abstracted into a five-category failure-mode taxonomy. all engines agree on the sign of every performance metric (conclusion stability index = 1), so implementation risk does not alter investment decisions for the strategies studied but introduces measurable ambiguity in performance attribution. code and benchmark data are publicly available.

Implementation Risk in Portfolio Backtesting: A Previously Unquantified Source of Error

Abstract

Portfolio backtesting is the primary tool for evaluating investment strategies before deployment, yet practitioners implicitly assume that different engines produce identical results for the same strategy. we formalise implementation risk, the systematic divergence in backtested portfolio metrics arising solely from differences in how engines implement the same logical strategy, and propose four metrics grounded in metrology to quantify it: engine sensitivity, implementation uncertainty interval, divergence amplification factor, and conclusion stability index. we execute 15 benchmark strategies through five independent open-source engines on 30 non-overlapping stratified asset buckets comprising 180 s&p 500 stocks under four transaction-cost regimes. at zero cost, all five engines agree exactly (maximum divergence 0.000%), isolating transaction-cost implementation as the sole source of disagreement. under nonzero costs, divergence is structured and predictable (spearman rho = 0.93 with cost intensity), remaining below 0.75 percentage points for most strategies but reaching 3.71% for high-turnover rotation strategies. source-code forensics uncovered seven previously undocumented defects across three engines, abstracted into a five-category failure-mode taxonomy. all engines agree on the sign of every performance metric (conclusion stability index = 1), so implementation risk does not alter investment decisions for the strategies studied but introduces measurable ambiguity in performance attribution. code and benchmark data are publicly available.
Paper Structure (61 sections, 6 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 61 sections, 6 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Mean equity curves across 30 asset buckets for all 15 benchmark strategies. Shaded bands show $\pm 1$ standard deviation of bucket-to-bucket variation. The five engine traces per benchmark are nearly indistinguishable at this scale, consistent with high agreement for most strategies.
  • Figure 2: Divergence heatmap across all 15 benchmarks and 10 engine pairs. Colour intensity encodes the mean pairwise relative difference in total return; the three-tier structure is clearly visible, with rotation strategies (BM03, BM04, BM11) producing the largest divergences.
  • Figure 3: Mean pairwise divergence versus composite cost-intensity score. The near-monotonic gradient indicates that cost intensity is the primary predictor of engine disagreement (Spearman $\rho = 0.93$, $p < 0.001$).
  • Figure 4: Divergence decomposed by strategy complexity. Higher-turnover strategies accumulate more cost-model disagreement, consistent with the linear scaling predicted by Conjecture \ref{['conj:divergence_scaling']}.
  • Figure 5: Anatomy of inter-engine divergence. Left: per-benchmark divergence for two representative engine pairs on log-scaled axes; the moderate Spearman correlation ($\rho = 0.60$) indicates that pairwise disagreements are only weakly coupled. Right: divergence driver fingerprint showing Spearman correlations between each driver (total cost, cost per trade, trade count, ml signal, volatility) and per-benchmark divergence for all ten engine pairs. The heterogeneous fingerprint across pairs indicates that no single engine pair captures the full spectrum of cost-model disagreement.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Conjecture 1