Table of Contents
Fetching ...

NoLBERT: A No Lookahead(back) Foundational Language Model

Ali Kakhbod, Peiyao Li

TL;DR

NoLBERT addresses information leakage biases in text-based economic prediction by enforcing a timestamped pretraining window (1976–1995) and validating temporal locality, then demonstrates its practical value by constructing year-specific firm innovation networks from patent texts. Built on a DeBERTa v3 base with a custom 30k tokenizer, NoLBERT achieves strong NLP benchmarks while providing temporally disciplined representations for econometric inference. The paper shows that growth in a firm's innovation centrality within patent-derived networks is significantly associated with higher medium- to long-run profit growth, suggesting diffusion leverage and enhanced complementarities as mechanisms. Overall, NoLBERT offers a bias-free, scalable foundation for textual econometrics and illustrates how network centrality of innovations can forecast firm performance.

Abstract

We present NoLBERT, a lightweight, timestamped foundational language model for empirical research -- particularly for forecasting in economics, finance, and the social sciences. By pretraining exclusively on text from 1976 to 1995, NoLBERT avoids both lookback and lookahead biases (information leakage) that can undermine econometric inference. It exceeds domain-specific baselines on NLP benchmarks while maintaining temporal consistency. Applied to patent texts, NoLBERT enables the construction of firm-level innovation networks and shows that gains in innovation centrality predict higher long-run profit growth.

NoLBERT: A No Lookahead(back) Foundational Language Model

TL;DR

NoLBERT addresses information leakage biases in text-based economic prediction by enforcing a timestamped pretraining window (1976–1995) and validating temporal locality, then demonstrates its practical value by constructing year-specific firm innovation networks from patent texts. Built on a DeBERTa v3 base with a custom 30k tokenizer, NoLBERT achieves strong NLP benchmarks while providing temporally disciplined representations for econometric inference. The paper shows that growth in a firm's innovation centrality within patent-derived networks is significantly associated with higher medium- to long-run profit growth, suggesting diffusion leverage and enhanced complementarities as mechanisms. Overall, NoLBERT offers a bias-free, scalable foundation for textual econometrics and illustrates how network centrality of innovations can forecast firm performance.

Abstract

We present NoLBERT, a lightweight, timestamped foundational language model for empirical research -- particularly for forecasting in economics, finance, and the social sciences. By pretraining exclusively on text from 1976 to 1995, NoLBERT avoids both lookback and lookahead biases (information leakage) that can undermine econometric inference. It exceeds domain-specific baselines on NLP benchmarks while maintaining temporal consistency. Applied to patent texts, NoLBERT enables the construction of firm-level innovation networks and shows that gains in innovation centrality predict higher long-run profit growth.

Paper Structure

This paper contains 25 sections, 6 equations, 2 figures, 8 tables.

Figures (2)

  • Figure A1: Summary statistics of PageRank and weighted-degree centrality.
  • Figure A2: Cumulative industry composition from $t$ to $2021$ of the most central firms.