Table of Contents
Fetching ...

PAC-Bayes Generalisation Bounds for Dynamical Systems Including Stable RNNs

Deividas Eringis, John Leth, Zheng-Hua Tan, Rafal Wisniewski, Mihaly Petreczky

TL;DR

This work develops non-asymptotic PAC-Bayes generalisation bounds for predictors realized by discrete-time dynamical systems with hidden states, covering LTIs and RNNs under stability constraints. By modeling data as outputs of class $\mathcal{S}$ systems and using a two-step proof strategy, the authors derive a Catoni-like bound that holds for non-i.i.d. time-series data and decays as $O(1/\sqrt{N})$ when the hyperparameter $\lambda$ scales as $\mathcal{O}(\sqrt{N})$. The bound explicitly involves the KL divergence between posterior and prior, the data-mixing constants, and Lipschitz properties of the loss and system dynamics, and is computable via Monte Carlo methods. A synthetic illustration demonstrates non-vacuity for modest sample sizes and emphasizes that the bound does not grow with the number of RNN steps, a notable advantage for long sequences. Overall, the framework provides a principled, computable way to assess and control generalisation for time-series models, including stable RNNs, under realistic non-i.i.d. data assumptions.

Abstract

In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases. In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN.

PAC-Bayes Generalisation Bounds for Dynamical Systems Including Stable RNNs

TL;DR

This work develops non-asymptotic PAC-Bayes generalisation bounds for predictors realized by discrete-time dynamical systems with hidden states, covering LTIs and RNNs under stability constraints. By modeling data as outputs of class systems and using a two-step proof strategy, the authors derive a Catoni-like bound that holds for non-i.i.d. time-series data and decays as when the hyperparameter scales as . The bound explicitly involves the KL divergence between posterior and prior, the data-mixing constants, and Lipschitz properties of the loss and system dynamics, and is computable via Monte Carlo methods. A synthetic illustration demonstrates non-vacuity for modest sample sizes and emphasizes that the bound does not grow with the number of RNN steps, a notable advantage for long sequences. Overall, the framework provides a principled, computable way to assess and control generalisation for time-series models, including stable RNNs, under realistic non-i.i.d. data assumptions.

Abstract

In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases. In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN.
Paper Structure (16 sections, 18 theorems, 164 equations, 2 figures, 1 table)

This paper contains 16 sections, 18 theorems, 164 equations, 2 figures, 1 table.

Key Result

Theorem 2.1

Let $\pi$ be a probability density on $\mathcal{H}$ and let $\mathcal{M}_{\pi}$ denote the set of all probability densities for which the corresponding probability measures are absolutely continuous w.r.t. to the probability measure defined by $\pi$. There exist constants $G_1$ and $G_2$, which depe where $D_{\mathrm{KL}}(\hat{\rho}\| \pi)\triangleq \mathbb{E}_{h\sim\hat{\rho}} \ln \frac{\hat{\rho

Figures (2)

  • Figure 1: Theorem \ref{['thm:mainThm']} is used to compute the results of the example described in Section \ref{['sec:example']}, evaluated on 10 different realisations of data.
  • Figure 2: Dependance of Lemmas (and proofs) in the Appendix. For example: Theorem 5.1 directly depends on Lemmas 2.1, A.2, and A.3

Theorems & Definitions (40)

  • Theorem 2.1: Informal theorem
  • Remark 2.1: Asymptotic properties: $O(1/\sqrt{N})$ bound
  • Remark 2.2: Intuition behind the constants
  • Lemma 2.1: Theorem 3 of nips-16
  • Definition 3.1: UEC and steady-state state and output trajectories PavlocTac2011
  • Definition 3.2: Class $\mathcal{S}$ system
  • Remark 3.1: Role of constants in robustness
  • Lemma 4.1
  • Remark 4.1
  • Theorem 5.1
  • ...and 30 more