Table of Contents
Fetching ...

Dynamic Universal Approximation Theory: The Basic Theory for Deep Learning-Based Computer Vision Models

Wei Wang, Qing Li

TL;DR

This work addresses the absence of a solid theoretical basis for CNNs and ViTs in computer vision by introducing Dynamic Universal Approximation Theory (DUAT), an input-dependent extension of the Universal Approximation Theorem (UAT). Using the Matrix-Vector Method, the authors recast core CV operations—convolution, pooling, MHA, and linear maps—into a unified matrix-vector form and show that both residual-based CNNs and Vision Transformers fall under the DUAT class. They explain why CNNs require deep architectures, why residual connections confer superior generalization, and how ViTs differ in their parameter modulation, yet share a common DUAT foundation with CNNs. The framework offers a theoretical lens to design CV models and suggests cognitive parallels with human vision, positing that dynamic adaptation under DUAT may underpin robust perception in both artificial and biological systems.

Abstract

Computer vision (CV) is one of the most crucial fields in artificial intelligence. In recent years, a variety of deep learning models based on convolutional neural networks (CNNs) and Transformers have been designed to tackle diverse problems in CV. These algorithms have found practical applications in areas such as robotics and facial recognition. Despite the increasing power of current CV models, several fundamental questions remain unresolved: Why do CNNs require deep layers? What ensures the generalization ability of CNNs? Why do residual-based networks outperform fully convolutional networks like VGG? What is the fundamental difference between residual-based CNNs and Transformer-based networks? Why can CNNs utilize LoRA and pruning techniques? The root cause of these questions lies in the lack of a robust theoretical foundation for deep learning models in CV. To address these critical issues and techniques, we employ the Universal Approximation Theorem (UAT) to provide a theoretical basis for convolution- and Transformer-based models in CV. By doing so, we aim to elucidate these questions from a theoretical perspective.

Dynamic Universal Approximation Theory: The Basic Theory for Deep Learning-Based Computer Vision Models

TL;DR

This work addresses the absence of a solid theoretical basis for CNNs and ViTs in computer vision by introducing Dynamic Universal Approximation Theory (DUAT), an input-dependent extension of the Universal Approximation Theorem (UAT). Using the Matrix-Vector Method, the authors recast core CV operations—convolution, pooling, MHA, and linear maps—into a unified matrix-vector form and show that both residual-based CNNs and Vision Transformers fall under the DUAT class. They explain why CNNs require deep architectures, why residual connections confer superior generalization, and how ViTs differ in their parameter modulation, yet share a common DUAT foundation with CNNs. The framework offers a theoretical lens to design CV models and suggests cognitive parallels with human vision, positing that dynamic adaptation under DUAT may underpin robust perception in both artificial and biological systems.

Abstract

Computer vision (CV) is one of the most crucial fields in artificial intelligence. In recent years, a variety of deep learning models based on convolutional neural networks (CNNs) and Transformers have been designed to tackle diverse problems in CV. These algorithms have found practical applications in areas such as robotics and facial recognition. Despite the increasing power of current CV models, several fundamental questions remain unresolved: Why do CNNs require deep layers? What ensures the generalization ability of CNNs? Why do residual-based networks outperform fully convolutional networks like VGG? What is the fundamental difference between residual-based CNNs and Transformer-based networks? Why can CNNs utilize LoRA and pruning techniques? The root cause of these questions lies in the lack of a robust theoretical foundation for deep learning models in CV. To address these critical issues and techniques, we employ the Universal Approximation Theorem (UAT) to provide a theoretical basis for convolution- and Transformer-based models in CV. By doing so, we aim to elucidate these questions from a theoretical perspective.
Paper Structure (19 sections, 25 equations, 12 figures)

This paper contains 19 sections, 25 equations, 12 figures.

Figures (12)

  • Figure 1: The basic format of UAT and The Matrix-Vector Method and their relationship.
  • Figure 2: 1-O Conv2D process and the matrix-vector transformation example. In a, we present the operation of 1-O Conv2D. At the top of a, b and c, we provide the mathematical formula corresponding to the diagram. Additionally, we group the variables in the figure, such as $\mathbf{W} = Concat(\mathbf{W}_1, \cdots, \mathbf{W}_O)$. In b, we provide an example corresponding to a, and in c, we show the matrix-vector form of the example given in b. The following figures follow the same conventions.
  • Figure 3: I-O Conv2D process and the matrix-vector transformation example.
  • Figure 4: 1-O Conv3D process and the matrix-vector transformation example.
  • Figure 5: I-1 Conv3D process and the matrix-vector transformation example.
  • ...and 7 more figures