Learning Neural Networks with Sparse Activations

Pranjal Awasthi; Nishanth Dikkala; Pritish Kamath; Raghu Meka

Learning Neural Networks with Sparse Activations

Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka

TL;DR

A formal study of PAC learnability of MLP layers that exhibit activation sparsity is initiated, and a variety of results are presented showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts.

Abstract

A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.

Learning Neural Networks with Sparse Activations

TL;DR

Abstract

Paper Structure (18 sections, 13 theorems, 43 equations)

This paper contains 18 sections, 13 theorems, 43 equations.

Introduction
Learning under uniform distribution.
Learning under general distributions.
Related Work
Preliminaries
Fourier Analysis and the Low-Degree Algorithm
Low-degree algorithm.
Learning over Uniform Distribution
Lower Bounds for Learning H n,s,1
Sparse Activations Can Simulate Juntas
Hardness Under Arbitrary Distributions
SQ Hardness
Cryptographic Hardness
Learning under General Distributions
Generalization to $k$-sparsely activated networks.
...and 3 more sections

Key Result

theorem 1

Any SQ algorithm for learning $\mathcal{H}_{n,O(\sqrt{n}),1}^{O(n^{0.75}), O(n)}$ under arbitrary distributions over the hypercube either requires $2^{-\Omega(\sqrt{n})}$ tolerance or $2^{\Omega(\sqrt{n})}$ queries. Assuming the hardness of learning with rounding problem with polynomial modulus, the

Theorems & Definitions (22)

definition 1: Sparsely Activated Networks
definition 2
theorem 1: Informal; see \ref{['sec:lb-uniform']}
theorem 2: Informal; see \ref{['thm:generalk-uniform-ub']}
theorem 3: Informal; see \ref{['thm:general-dist-upper-bound']}
Claim 2.1
lemma 1: kane14average
proof
lemma 2
theorem 4
...and 12 more

Learning Neural Networks with Sparse Activations

TL;DR

Abstract

Learning Neural Networks with Sparse Activations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (22)