Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes

Nikita Kiselev; Andrey Grabovoy

Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes

Nikita Kiselev, Andrey Grabovoy

TL;DR

This paper investigates how the loss surface changes when the sample size increases, a previously unexplored issue, and theoretically analyzes the convergence of the loss landscape in a fully connected neural network to derive upper bounds for the difference in loss function values when adding a new object to the sample.

Abstract

The loss landscape of neural networks is a critical aspect of their training, and understanding its properties is essential for improving their performance. In this paper, we investigate how the loss surface changes when the sample size increases, a previously unexplored issue. We theoretically analyze the convergence of the loss landscape in a fully connected neural network and derive upper bounds for the difference in loss function values when adding a new object to the sample. Our empirical study confirms these results on various datasets, demonstrating the convergence of the loss function surface for image classification tasks. Our findings provide insights into the local geometry of neural loss landscapes and have implications for the development of sample size determination techniques.

Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes

TL;DR

Abstract

Paper Structure (20 sections, 3 theorems, 51 equations, 5 figures, 1 table)

This paper contains 20 sections, 3 theorems, 51 equations, 5 figures, 1 table.

Introduction
Related Work
Preliminaries
General notation
Second-order approximation
Fully connected neural network
Hessian decomposition
Convergence of the loss difference
Boundedness of the Hessian
Losses difference convergence
Experiments
Direct Image Classification
Image features extraction
Discussion
Conclusion
...and 5 more sections

Key Result

Theorem 1

Consider a $L$-layer fully connected neural network with ReLU activation function and without bias terms, applied to solve a $K$-label classification problem. Suppose the following is satisfied: $\| \mathbf{W}^{(p)} \|_2 \leqslant M_{\mathbf{W}}$ and $\| \mathbf{x}_i \|_2 \leqslant M_{\mathbf{x}}$ f

Figures (5)

Figure 1: Overview of our observations. Part (a) shows the loss function landscape, which is a surface in the parameters space. Part (b) shows the losses difference. It arises, when one more object is added to the dataset. Here we exhibit the behavior for dimension equals 2. Near the minimum $\boldsymbol{\theta}^*$, the mean loss value for $k+1$ objects $\mathcal{L}_{k+1}(\boldsymbol{\theta})$ tends to be similar to the same for $k$ objects $\mathcal{L}_{k}(\boldsymbol{\theta})$.
Figure 2: The dependence of the absolute value of the loss function difference on the available sample size, direct image classification. The graphs on the left show a decrease in values as the dimension of the hidden layer increases. The graphs on the right show an increase in values as the number of layers increases.
Figure 3: The dependence of the absolute value of the loss function difference on the available sample size, image features extraction. The graphs on the left show a decrease in values as the dimension of the hidden layer increases. The graphs on the right show an increase in values as the number of layers increases.
Figure 4: The dependence of the absolute value of the loss function difference on the available sample size, direct image classification. The graphs on the left show a decrease in values as the dimension of the hidden layer increases. The graphs on the right show an increase in values as the number of layers increases. Results on different datasets: FashionMNIST, CIFAR10 and CIFAR100.
Figure 5: The dependence of the absolute value of the loss function difference on the available sample size, image features extraction. The graphs on the left show a decrease in values as the dimension of the hidden layer increases. The graphs on the right show an increase in values as the number of layers increases. Results on different datasets: FashionMNIST, CIFAR10 and CIFAR100.

Theorems & Definitions (6)

Theorem 1
Lemma 1
Lemma 2
proof
proof
proof

Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes

TL;DR

Abstract

Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)