Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

Hengyu Fu; Zihao Wang; Eshaan Nichani; Jason D. Lee

Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

Hengyu Fu, Zihao Wang, Eshaan Nichani, Jason D. Lee

TL;DR

This work studies the learning of hierarchical polynomials of multiple nonlinear features using three-layer neural networks, and shows that a three-layer neural network trained via layerwise gradient descent suffices for efficient feature learning.

Abstract

In deep learning theory, a critical question is to understand how neural networks learn hierarchical features. In this work, we study the learning of hierarchical polynomials of \textit{multiple nonlinear features} using three-layer neural networks. We examine a broad class of functions of the form $f^{\star}=g^{\star}\circ \bp$, where $\bp:\mathbb{R}^{d} \rightarrow \mathbb{R}^{r}$ represents multiple quadratic features with $r \ll d$ and $g^{\star}:\mathbb{R}^{r}\rightarrow \mathbb{R}$ is a polynomial of degree $p$. This can be viewed as a nonlinear generalization of the multi-index model \citep{damian2022neural}, and also an expansion upon previous work that focused only on a single nonlinear feature, i.e. $r = 1$ \citep{nichani2023provable,wang2023learning}. Our primary contribution shows that a three-layer neural network trained via layerwise gradient descent suffices for \begin{itemize}\item complete recovery of the space spanned by the nonlinear features \item efficient learning of the target function $f^{\star}=g^{\star}\circ \bp$ or transfer learning of $f=g\circ \bp$ with a different link function \end{itemize} within $\widetilde{\cO}(d^4)$ samples and polynomial time. For such hierarchical targets, our result substantially improves the sample complexity $Θ(d^{2p})$ of the kernel methods, demonstrating the power of efficient feature learning. It is important to highlight that{ our results leverage novel techniques and thus manage to go beyond all prior settings} such as single-index and multi-index models as well as models depending just on one nonlinear feature, contributing to a more comprehensive understanding of feature learning in deep learning.

Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

TL;DR

Abstract

, where

represents multiple quadratic features with

and

is a polynomial of degree

. This can be viewed as a nonlinear generalization of the multi-index model \citep{damian2022neural}, and also an expansion upon previous work that focused only on a single nonlinear feature, i.e.

\citep{nichani2023provable,wang2023learning}. Our primary contribution shows that a three-layer neural network trained via layerwise gradient descent suffices for \begin{itemize}\item complete recovery of the space spanned by the nonlinear features \item efficient learning of the target function

or transfer learning of

with a different link function \end{itemize} within

samples and polynomial time. For such hierarchical targets, our result substantially improves the sample complexity

of the kernel methods, demonstrating the power of efficient feature learning. It is important to highlight that{ our results leverage novel techniques and thus manage to go beyond all prior settings} such as single-index and multi-index models as well as models depending just on one nonlinear feature, contributing to a more comprehensive understanding of feature learning in deep learning.

Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

TL;DR

Abstract

Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (81)