Generalization in Representation Models via Random Matrix Theory: Application to Recurrent Networks
Yessin Moakher, Malik Tiomoko, Cosme Louart, Zhenyu Liao
TL;DR
The paper develops a unified random-matrix theory framework to analyze generalization for models that learn only a linear readout on top of fixed representations, including recurrent maps such as Echo State Networks. In the high-dimensional regime, it yields deterministic asymptotic expressions for the out-of-sample risk, decomposing it into bias and variance components, and provides explicit formulas in terms of fixed-point quantities and resolvents. Specializing to Linear ESNs, the authors show an equivalence to ridge regression with an exponentially memory-weighted input covariance, revealing a bias toward recent inputs and explaining the absence of double descent in this setting. Empirical results corroborate the theory, showing ESNs outperform ridge in low-data, short-memory scenarios and ridge regains superiority with more data or longer dependencies. Overall, the work offers a general theoretical framework for understanding generalization in overparameterized models with fixed representations and yields practical insights for designing recurrent representations.
Abstract
We first study the generalization error of models that use a fixed feature representation (frozen intermediate layers) followed by a trainable readout layer. This setting encompasses a range of architectures, from deep random-feature models to echo-state networks (ESNs) with recurrent dynamics. Working in the high-dimensional regime, we apply Random Matrix Theory to derive a closed-form expression for the asymptotic generalization error. We then apply this analysis to recurrent representations and obtain concise formula that characterize their performance. Surprisingly, we show that a linear ESN is equivalent to ridge regression with an exponentially time-weighted (''memory'') input covariance, revealing a clear inductive bias toward recent inputs. Experiments match predictions: ESNs win in low-sample, short-memory regimes, while ridge prevails with more data or long-range dependencies. Our methodology provides a general framework for analyzing overparameterized models and offers insights into the behavior of deep learning networks.
