An Over-parameterized Exponential Regression

Yeqi Gao; Sridhar Mahadevan; Zhao Song

An Over-parameterized Exponential Regression

Yeqi Gao, Sridhar Mahadevan, Zhao Song

TL;DR

The neural function F is defined using an exponential activation function to optimize the over-parameterization bound $m, and several tight analysis techniques from previous studies are employed.

Abstract

Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function $F: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R}$ using an exponential activation function. Given a set of data points with labels $\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R}$ where $n$ denotes the number of the data. Here $F(W(t),x)$ can be expressed as $F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle)$, where $m$ represents the number of neurons, and $w_r(t)$ are weights at time $t$. It's standard in literature that $a_r$ are the fixed weights and it's never changed during the training. We initialize the weights $W(0) \in \mathbb{R}^{d \times m}$ with random Gaussian distributions, such that $w_r(0) \sim \mathcal{N}(0, I_d)$ and initialize $a_r$ from random sign distribution for each $r \in [m]$. Using the gradient descent algorithm, we can find a weight $W(T)$ such that $\| F(W(T), X) - y \|_2 \leq ε$ holds with probability $1-δ$, where $ε\in (0,0.1)$ and $m = Ω(n^{2+o(1)}\log(n/δ))$. To optimize the over-parameterization bound $m$, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022].

An Over-parameterized Exponential Regression

TL;DR

The neural function F is defined using an exponential activation function to optimize the over-parameterization bound $m, and several tight analysis techniques from previous studies are employed.

Abstract

using an exponential activation function. Given a set of data points with labels

where

denotes the number of the data. Here

can be expressed as

, where

represents the number of neurons, and

are weights at time

. It's standard in literature that

are the fixed weights and it's never changed during the training. We initialize the weights

with random Gaussian distributions, such that

and initialize

from random sign distribution for each

. Using the gradient descent algorithm, we can find a weight

such that

holds with probability

, where

and

. To optimize the over-parameterization bound

, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022].

Paper Structure (45 sections, 15 theorems, 94 equations)

This paper contains 45 sections, 15 theorems, 94 equations.

Introduction
Our Results
Related Work
Training over-parameterized neural network
Convergence
Over-parametrization bound, bound on $m$
Using data structure to speedup cost per iteration
Attention Theory
Fast computation and optimization
Expressivity for transformer
In-context learning
Other applications and theories of transformer
Roadmap.
Technique Overview
Bounding the loss by induction
...and 30 more sections

Key Result

Theorem 1.1

Let $\delta \in (0,0.1)$ denote the failure probability. Let $\epsilon \in (0,0.1)$ denote the accuracy. If the following conditions hold Then, we have after running algorithm with $T$ iterations. And with probability at least $1-\delta$, we obtain a $w(T)$ such that

Theorems & Definitions (50)

Theorem 1.1: Main result, formal version of Theorem \ref{['thm:formal']}
Definition 4.1
Definition 4.2
Definition 4.3
Lemma 4.8: Bernstein inequality b24
Lemma 4.9: Hoeffding inequality h63
Lemma 4.10: Laurent and Massart lm00
Definition 5.1: $F(t)$, dynamic prediction
Definition 5.2: Loss function over time
Definition 5.3: $\Delta w_r(t)$
...and 40 more

An Over-parameterized Exponential Regression

TL;DR

Abstract

An Over-parameterized Exponential Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (50)