All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

Zeyu Zhang; Ryan Chen; Bradly C. Stadie

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

Zeyu Zhang, Ryan Chen, Bradly C. Stadie

TL;DR

TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.

Abstract

To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination -- producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

TL;DR

Abstract

Paper Structure (71 sections, 8 equations, 3 figures, 6 tables)

This paper contains 71 sections, 8 equations, 3 figures, 6 tables.

Introduction
Related Work
Temporal Leakage and Data Contamination.
LLM-based Forecasting and Backtesting.
Attribution and Rationale Evaluation.
Preliminaries
Problem Setting and Temporal Leakage
Claim Taxonomy
Temporal Leakage Evaluation
Phase 1: Claim Extraction
Phase 2: Shapley Value Computation
Phase 3: Leakage Detection
Phase 4: Leakage Metrics
TimeSPEC Architecture
Generator
...and 56 more sections

Figures (3)

Figure 1: Overview of the temporal leakage evaluation pipeline. Given a prediction rationale $R$, reference time $t_{\text{ref}}$, and task context, Phase 1 extracts atomic claims $\{c_i\}$ and assigns each a category label from our taxonomy. Phases 2 and 3 execute in parallel: Phase 2 computes Shapley values $\{\phi_i\}$ quantifying each claim's contribution to the prediction via Monte Carlo sampling; Phase 3 determines leakage indicators $\{\ell_i\}$ using category-based rules. Phase 4 aggregates these outputs into two complementary metrics: Overall Leakage Rate (OLR), which treats all claims equally, and Shapley-weighted Decision-Critical Leakage Rate (Shapley-DCLR), which weights leakage by each claim's predictive importance.
Figure 2: Architecture of TimeSPEC. Phase 1 (Generator) performs temporally-filtered search retrieving only documents published before $t_{\text{ref}}$, producing a draft prediction with rationale. Phase 2 (Supervisor) extracts claims, assigns category labels, and apply external search to detect temporal violations. If violations exist, Phase 3 (Regenerator) produces an improved prediction using validated claims and diverse new queries, followed by Phase 4 (Resupervisor) validation. Phase 5 (Aggregator) synthesizes the final prediction from all validated claims $\mathcal{C}_{\text{valid}}^{(1)} \cup \mathcal{C}_{\text{valid}}^{(2)}$ using category-aware reasoning under a closed-world constraint.
Figure 3: Two-dimensional evaluation of prediction agents across three tasks. X-axis: transformed performance (1-BS, 1-RE, $\rho$; higher = better prediction). Y-axis: Shapley-DCLR (lower = less leakage). The ideal region is the lower-right: accurate predictions without relying on future information. Left (Legal): All agents achieve high performance and near-zero leakage. Center (Salary): TimeSPEC achieves 75% leakage reduction while maintaining reasonable performance. Right (Stock): Baselines achieve high performance but with substantial leakage; TimeSPEC's lower performance with near-zero leakage reflects honest inference from pre-cutoff data. Markers: $\blacksquare$ Superforecasting, $\blacktriangle$ Temporal Hint, $\bullet$ TimeSPEC (Ours).

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

TL;DR

Abstract

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

Authors

TL;DR

Abstract

Table of Contents

Figures (3)