Table of Contents
Fetching ...

Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

Barak Gahtan, Alex M. Bronstein

TL;DR

This work tackles two gaps in temporal deep learning: how to fairly evaluate models on dependent sequences and how to provide architecture-aware generalization guarantees. It introduces a fair-comparison methodology that fixes the effective sample size, empirically shows that strong temporal dependence can enhance generalization under matched information content, and develops the first architecture-aware generalization bounds for deep temporal networks on beta-mixing data using a blocking, delayed-feedback framework. The theoretical results yield polynomial learnability with depth scaling ~√D and a product of layer-norms, while empirical findings reveal a noticeable theory–practice gap and practical architectural guidance. Collectively, the paper reframes temporal dependencies from a learning obstacle to a potential architectural advantage under principled evaluation and theory, with implications for model design and evaluation standards in time-series forecasting and physiological signal analysis.

Abstract

Deep temporal architectures such as TCNs achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap through three contributions: introducing an evaluation methodology for temporal models, revealing surprising empirical phenomena about temporal dependence, and the first architecture-aware theoretical framework for dependent sequences. Fair-Comparison Methodology. We introduce evaluation protocols that fix effective sample size $N_{\text{eff}}$ to isolate temporal structure effects from information content. Empirical Findings. Applying this method reveals that under $N_{\text{eff}} = 2000$, strongly dependent sequences ($ρ= 0.8$) exhibit approx' $76\%$ smaller generalization gaps than weakly dependent ones ($ρ= 0.2$), challenging the conventional view that dependence universally impedes learning. However, observed convergence rates ($N_{\text{eff}}^{-1.21}$ to $N_{\text{eff}}^{-0.89}$) significantly exceed theoretical worst-case predictions ($N^{-0.5}$), revealing that temporal architectures exploit problem structure in ways current theory does not capture. Lastly, we develop the first architecture-aware generalization bounds for deep temporal models on exponentially $β$-mixing sequences. By embedding Golowich et al.'s i.i.d. class bound within a novel blocking scheme that partitions $N$ samples into approx' $B \approx N/\log N$ quasi-independent blocks, we establish polynomial sample complexity under convex Lipschitz losses. The framework achieves $\sqrt{D}$ depth scaling alongside the product of layer-wise norms $R = \prod_{\ell=1}^{D} M^{(\ell)}$, avoiding exponential dependence. While these bounds are conservative, they prove learnability and identify architectural scaling laws, providing worst-case baselines that highlight where future theory must improve.

Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

TL;DR

This work tackles two gaps in temporal deep learning: how to fairly evaluate models on dependent sequences and how to provide architecture-aware generalization guarantees. It introduces a fair-comparison methodology that fixes the effective sample size, empirically shows that strong temporal dependence can enhance generalization under matched information content, and develops the first architecture-aware generalization bounds for deep temporal networks on beta-mixing data using a blocking, delayed-feedback framework. The theoretical results yield polynomial learnability with depth scaling ~√D and a product of layer-norms, while empirical findings reveal a noticeable theory–practice gap and practical architectural guidance. Collectively, the paper reframes temporal dependencies from a learning obstacle to a potential architectural advantage under principled evaluation and theory, with implications for model design and evaluation standards in time-series forecasting and physiological signal analysis.

Abstract

Deep temporal architectures such as TCNs achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap through three contributions: introducing an evaluation methodology for temporal models, revealing surprising empirical phenomena about temporal dependence, and the first architecture-aware theoretical framework for dependent sequences. Fair-Comparison Methodology. We introduce evaluation protocols that fix effective sample size to isolate temporal structure effects from information content. Empirical Findings. Applying this method reveals that under , strongly dependent sequences () exhibit approx' smaller generalization gaps than weakly dependent ones (), challenging the conventional view that dependence universally impedes learning. However, observed convergence rates ( to ) significantly exceed theoretical worst-case predictions (), revealing that temporal architectures exploit problem structure in ways current theory does not capture. Lastly, we develop the first architecture-aware generalization bounds for deep temporal models on exponentially -mixing sequences. By embedding Golowich et al.'s i.i.d. class bound within a novel blocking scheme that partitions samples into approx' quasi-independent blocks, we establish polynomial sample complexity under convex Lipschitz losses. The framework achieves depth scaling alongside the product of layer-wise norms , avoiding exponential dependence. While these bounds are conservative, they prove learnability and identify architectural scaling laws, providing worst-case baselines that highlight where future theory must improve.

Paper Structure

This paper contains 29 sections, 4 theorems, 41 equations, 11 figures, 3 tables.

Key Result

lemma 1

Under Assumption ass:mix, the first elements of each block are nearly independent in the following sense:

Figures (11)

  • Figure 1: Illustration of the blocking mechanism. The time series is partitioned into blocks of length $d+1=4$, with first elements (blue) separated by $d+1=4$ positions (or equivalently, $d=3$ intervening positions). This spacing ensures dependence between these elements decays according to $\beta(d)$. When $d$ is chosen optimally as $\lceil\log N/c_0\rceil$, the total variation distance between the joint distribution of the first elements and the product of their marginals is bounded by $B\times\beta(d)$.
  • Figure 2: Fair comparison reveals complex scaling relationships that exceed theoretical predictions. The y-axis shows empirical generalization gap divided by theoretical bound; lower values indicate tighter bounds. Dotted lines show power-law fits ($N_{\text{eff}}^{-1.21}$ for $\rho=0.2$, $N_{\text{eff}}^{-0.89}$ for $\rho=0.8$), both substantially steeper than the predicted $N^{-0.5}$ rate (gray dashed line), revealing that temporal architectures exploit structure beyond worst-case theory. Error bars represent standard error across 12 trials (3 trials $\times$ 4 depths per condition).
  • Figure 3: Depth scaling under fair comparison shows weaker empirical dependence than theoretical $\sqrt{D}$ prediction. At $N_{\text{eff}}=2000$, generalization gaps remain relatively stable across depths, particularly for strong dependencies ($\rho=0.8$). This deviation from the $\sqrt{D}$ reference line suggests TCNs exploit temporal smoothness in AR(1) processes more effectively than worst-case analysis predicts. The high variance at $D=8$ may reflect optimization challenges for very deep networks on limited data.
  • Figure 4: PhysioNet: Empirical generalization gap vs. sequence length. The empirical gap decreases faster ($N^{-0.79}$) than the predicted theoretical rate ($N^{-1/2}$), suggesting that physiological signals contain structured regularities that enable more efficient learning than generic $\beta$-mixing processes.
  • Figure 5: PhysioNet: Empirical generalization gap vs. network depth. The empirical gaps grow approximately linearly with depth, tracking an $O(D)$ trend indicated by the dashed reference line in the legend, whereas our theory predicts $O(\sqrt{D})$ scaling. Despite this steeper-than-theoretical growth, absolute gaps remain small for practical depths, and the qualitative depth dependence is consistent across random seeds. Error bars show $\pm$ 1 s.e. over three training runs per depth.
  • ...and 6 more figures

Theorems & Definitions (9)

  • Remark 1: Extension to Polynomial Mixing
  • Remark 2: Why Exponential Mixing Is Reasonable for ECG‑like Signals
  • lemma 1: Blocking Lemma
  • proof : Proof Sketch
  • Proposition 1: Block-Level Concentration
  • proof : Proof Sketch
  • lemma 2: TCN Rademacher Complexity via Golowich et al.
  • theorem 1: Architecture-Aware Framework for Dependent Sequences
  • proof : Proof Sketch