Table of Contents
Fetching ...

Approximated Orthogonal Projection Unit: Stabilizing Regression Network Training Using Natural Gradient

Shaoqi Wang, Chunjie Yang, Siwei Lou

TL;DR

It is proved that AOPU attains minimum variance estimation (MVE) in NN, wherein the truncated gradient approximates the natural gradient (NG) and the truncated gradient truncates the gradient backpropagation at dual parameters.

Abstract

Neural networks (NN) are extensively studied in cutting-edge soft sensor models due to their feature extraction and function approximation capabilities. Current research into network-based methods primarily focuses on models' offline accuracy. Notably, in industrial soft sensor context, online optimizing stability and interpretability are prioritized, followed by accuracy. This requires a clearer understanding of network's training process. To bridge this gap, we propose a novel NN named the Approximated Orthogonal Projection Unit (AOPU) which has solid mathematical basis and presents superior training stability. AOPU truncates the gradient backpropagation at dual parameters, optimizes the trackable parameters updates, and enhances the robustness of training. We further prove that AOPU attains minimum variance estimation (MVE) in NN, wherein the truncated gradient approximates the natural gradient (NG). Empirical results on two chemical process datasets clearly show that AOPU outperforms other models in achieving stable convergence, marking a significant advancement in soft sensor field.

Approximated Orthogonal Projection Unit: Stabilizing Regression Network Training Using Natural Gradient

TL;DR

It is proved that AOPU attains minimum variance estimation (MVE) in NN, wherein the truncated gradient approximates the natural gradient (NG) and the truncated gradient truncates the gradient backpropagation at dual parameters.

Abstract

Neural networks (NN) are extensively studied in cutting-edge soft sensor models due to their feature extraction and function approximation capabilities. Current research into network-based methods primarily focuses on models' offline accuracy. Notably, in industrial soft sensor context, online optimizing stability and interpretability are prioritized, followed by accuracy. This requires a clearer understanding of network's training process. To bridge this gap, we propose a novel NN named the Approximated Orthogonal Projection Unit (AOPU) which has solid mathematical basis and presents superior training stability. AOPU truncates the gradient backpropagation at dual parameters, optimizes the trackable parameters updates, and enhances the robustness of training. We further prove that AOPU attains minimum variance estimation (MVE) in NN, wherein the truncated gradient approximates the natural gradient (NG). Empirical results on two chemical process datasets clearly show that AOPU outperforms other models in achieving stable convergence, marking a significant advancement in soft sensor field.
Paper Structure (31 sections, 6 theorems, 26 equations, 16 figures, 5 tables, 2 algorithms)

This paper contains 31 sections, 6 theorems, 26 equations, 16 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

There does not exist an transition operator $T$ independent of $x$ such that for a given parameter matrix $W$, and $\forall x_1,x_2$, the following equations hold, Proof is in Appendix sec:proof1.

Figures (16)

  • Figure 1: Trackable parameters and Untrackable parameters. Solid green lines represent model parameters, and orange curves represent non-parametric operations. (a) Conventional deep NN framework. (b) Typical broad learning system framework through data enhancement.
  • Figure 2: Comparison between NGD and GD. Direction matters more than step size (learning rate) in stable convergence.
  • Figure 3: AOPU's data flow schematic. The gradient is backpropagated but truncated at the dual parameter, and this gradient is then used to update the trackable parameter.
  • Figure 4: Histogram of the frequency distribution of RR on SRU dataset under varying batch sizes and sequence length settings.
  • Figure 5: Curve of the mean of RR distribution on SRU dataset under varying batch sizes and sequence length settings.
  • ...and 11 more figures

Theorems & Definitions (13)

  • Proposition 1
  • Definition 1
  • Proposition 2
  • Proof
  • Proposition 3
  • Proof
  • Definition 2
  • Proposition 4
  • Proof
  • Proposition 5
  • ...and 3 more