Instruction Tuning Chronologically Consistent Language Models
Songrun He, Linying Lv, Asaf Manela, Jimmy Wu
TL;DR
Problem: lookahead bias in LLM predictions arises when training data include information after the knowledge-cutoff $\tau$. Approach: ChronoGPT-Instruct enforces a leakage-free regime via a two-stage pretraining and instruction finetuning pipeline and a formal independence contract, requiring $\frac{q_{T|D}(t_r)}{q_T(t_r)}=1$ for all $r$. Data and evaluation: temporally filtered corpora up to $\tau$ and post-cutoff tasks are evaluated with strict temporal separation; analyses include instruction-following benchmarks, chronology validation, and prompt-based trading tests using firm-specific headlines from Dow Jones Newswire ($2007$–$2023$) merged with CRSP returns. Findings: pre-cutoff predictions are strong within leakage-free bounds; post-cutoff leakage is not detected; even smaller chronologically constrained models retain a substantial portion (roughly $54\%$–$62\%$) of apparent return predictability relative to larger leakage-prone models, indicating the utility of the framework as a conservative benchmark. Significance: provides a transparent, reproducible platform for robustness tests and guidance on how much predictive performance is attributable to training leakage versus genuine temporal signal.
Abstract
We introduce a family of chronologically consistent, instruction-tuned large language models to eliminate lookahead bias. Each model is trained only on data available before a clearly defined knowledge-cutoff date, ensuring strict temporal separation from any post-cutoff data. The resulting framework offers (i) a simple, conversational chat interface, (ii) fully open, fixed model weights that guarantee replicability, and (iii) a conservative lower bound on forecast accuracy, isolating the share of predictability that survives once training leakage is removed. Together, these features provide researchers with an easy-to-use generative AI tool useful for a wide range of prediction tasks that is free of lookahead bias.
