Table of Contents
Fetching ...

Implicit Bias in Deep Linear Discriminant Analysis

Jiawen Li

TL;DR

By analyzing the gradient flow of the loss on a L-layer diagonal linear network, it is proved that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates, which demonstrates an automatic conservation of the (2/L) quasi-norm.

Abstract

While the Implicit Bias(or Implicit Regularization) of standard loss functions has been studied, the optimization geometry induced by discriminative metric-learning objectives remains largely unexplored.To the best of our knowledge, this paper presents an initial theoretical analysis of the implicit regularization induced by the Deep LDA,a scale invariant objective designed to minimize intraclass variance and maximize interclass distance. By analyzing the gradient flow of the loss on a L-layer diagonal linear network, we prove that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates, which demonstrates an automatic conservation of the (2/L) quasi-norm.

Implicit Bias in Deep Linear Discriminant Analysis

TL;DR

By analyzing the gradient flow of the loss on a L-layer diagonal linear network, it is proved that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates, which demonstrates an automatic conservation of the (2/L) quasi-norm.

Abstract

While the Implicit Bias(or Implicit Regularization) of standard loss functions has been studied, the optimization geometry induced by discriminative metric-learning objectives remains largely unexplored.To the best of our knowledge, this paper presents an initial theoretical analysis of the implicit regularization induced by the Deep LDA,a scale invariant objective designed to minimize intraclass variance and maximize interclass distance. By analyzing the gradient flow of the loss on a L-layer diagonal linear network, we prove that under balanced initialization, the network architecture transforms standard additive gradient updates into multiplicative weight updates, which demonstrates an automatic conservation of the (2/L) quasi-norm.
Paper Structure (11 sections, 29 equations, 2 figures)

This paper contains 11 sections, 29 equations, 2 figures.

Figures (2)

  • Figure 1: The Optimization in Simplex and the Implicit Bias
  • Figure 2: The Simulated Result for DeepLDA in DLNs.