Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology
Barak Gahtan, Alex M. Bronstein
TL;DR
This work tackles two gaps in temporal deep learning: how to fairly evaluate models on dependent sequences and how to provide architecture-aware generalization guarantees. It introduces a fair-comparison methodology that fixes the effective sample size, empirically shows that strong temporal dependence can enhance generalization under matched information content, and develops the first architecture-aware generalization bounds for deep temporal networks on beta-mixing data using a blocking, delayed-feedback framework. The theoretical results yield polynomial learnability with depth scaling ~√D and a product of layer-norms, while empirical findings reveal a noticeable theory–practice gap and practical architectural guidance. Collectively, the paper reframes temporal dependencies from a learning obstacle to a potential architectural advantage under principled evaluation and theory, with implications for model design and evaluation standards in time-series forecasting and physiological signal analysis.
Abstract
Deep temporal architectures such as TCNs achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap through three contributions: introducing an evaluation methodology for temporal models, revealing surprising empirical phenomena about temporal dependence, and the first architecture-aware theoretical framework for dependent sequences. Fair-Comparison Methodology. We introduce evaluation protocols that fix effective sample size $N_{\text{eff}}$ to isolate temporal structure effects from information content. Empirical Findings. Applying this method reveals that under $N_{\text{eff}} = 2000$, strongly dependent sequences ($ρ= 0.8$) exhibit approx' $76\%$ smaller generalization gaps than weakly dependent ones ($ρ= 0.2$), challenging the conventional view that dependence universally impedes learning. However, observed convergence rates ($N_{\text{eff}}^{-1.21}$ to $N_{\text{eff}}^{-0.89}$) significantly exceed theoretical worst-case predictions ($N^{-0.5}$), revealing that temporal architectures exploit problem structure in ways current theory does not capture. Lastly, we develop the first architecture-aware generalization bounds for deep temporal models on exponentially $β$-mixing sequences. By embedding Golowich et al.'s i.i.d. class bound within a novel blocking scheme that partitions $N$ samples into approx' $B \approx N/\log N$ quasi-independent blocks, we establish polynomial sample complexity under convex Lipschitz losses. The framework achieves $\sqrt{D}$ depth scaling alongside the product of layer-wise norms $R = \prod_{\ell=1}^{D} M^{(\ell)}$, avoiding exponential dependence. While these bounds are conservative, they prove learnability and identify architectural scaling laws, providing worst-case baselines that highlight where future theory must improve.
