Table of Contents
Fetching ...

A Brief Survey on the Approximation Theory for Sequence Modelling

Haotian Jiang, Qianxiao Li, Zhong Li, Shida Wang

TL;DR

The survey develops a unified view of sequence modelling through classical approximation theory, recasting architectures such as RNNs, temporal CNNs, encoder–decoder, and Transformers as hypothesis spaces for sequence-to-sequence functionals. It delineates universal approximation (density) results, and for several linear or memory-decaying settings, Jackson-type rate and Bernstein-type inverse results, highlighting how memory properties shape approximation efficiency. Key findings include density universality for RNNs under fading memory, Jackson-type rates tied to memory decay for linear RNNs, and a memory-structure-based comparison between RNNs and CNNs, with Transformer theory remaining largely open for rate results. The article also outlines practical goals (model selection and simplification) and mathematical directions (defining sequence-approximation spaces and exploring optimization/generalization in the sequential regime). Together, these results provide a blueprint for developing a principled theory of sequence modelling and guiding architecture choice in practice.

Abstract

We survey current developments in the approximation theory of sequence modelling in machine learning. Particular emphasis is placed on classifying existing results for various model architectures through the lens of classical approximation paradigms, and the insights one can gain from these results. We also outline some future research directions towards building a theory of sequence modelling.

A Brief Survey on the Approximation Theory for Sequence Modelling

TL;DR

The survey develops a unified view of sequence modelling through classical approximation theory, recasting architectures such as RNNs, temporal CNNs, encoder–decoder, and Transformers as hypothesis spaces for sequence-to-sequence functionals. It delineates universal approximation (density) results, and for several linear or memory-decaying settings, Jackson-type rate and Bernstein-type inverse results, highlighting how memory properties shape approximation efficiency. Key findings include density universality for RNNs under fading memory, Jackson-type rates tied to memory decay for linear RNNs, and a memory-structure-based comparison between RNNs and CNNs, with Transformer theory remaining largely open for rate results. The article also outlines practical goals (model selection and simplification) and mathematical directions (defining sequence-approximation spaces and exploring optimization/generalization in the sequential regime). Together, these results provide a blueprint for developing a principled theory of sequence modelling and guiding architecture choice in practice.

Abstract

We survey current developments in the approximation theory of sequence modelling in machine learning. Particular emphasis is placed on classifying existing results for various model architectures through the lens of classical approximation paradigms, and the insights one can gain from these results. We also outline some future research directions towards building a theory of sequence modelling.
Paper Structure (20 sections, 58 equations, 1 figure, 1 table)

This paper contains 20 sections, 58 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Schematic illustration of a high-rank vs low-rank sequential relationship under the temporal product structure. A dataset of input sequences (left) are fed into a functional sequence producing the corresponding output sequences (right). The top (resp. bottom) right plot shows the resulting sequence of a high-rank (resp. low-rank) relationship. Observe that the high rank relationship yields a complex and input-sensitive temporal structure. In contrast, the outputs of the low rank relationship exhibit greater regularity, with only macroscopic structures present. It is precisely the latter that REncDec is adapted to model.