Neural Feature Learning in Function Space

Xiangxiang Xu; Lizhong Zheng

Neural Feature Learning in Function Space

Xiangxiang Xu, Lizhong Zheng

TL;DR

This work introduces a principled framework for neural feature learning based on a function-space feature geometry that links statistical dependence to learned features via the canonical dependence kernel and H-score. It then develops the nesting technique to decompose and learn dependence components (and their modal decompositions) in multivariate settings, enabling flexible assembling of features into diverse inference models without retraining. The approach is demonstrated across conditional inference, side information, and multimodal learning with missing modalities, with theoretical results tying to maximum entropy, MLE in local regimes, and connections to classical regression and multitask networks. Empirically, the authors verify maximal-variance dependence modes across discrete, continuous, and sequential data, and show the ability to reconstruct posterior relations and conditional expectations from learned features. Overall, the framework provides a scalable, interpretable, and modular pathway to harness neural feature extractors for rich multivariate dependence representations.

Abstract

We present a novel framework for learning system design with neural feature extractors. First, we introduce the feature geometry, which unifies statistical dependence and feature representations in a function space equipped with inner products. This connection defines function-space concepts on statistical dependence, such as norms, orthogonal projection, and spectral decomposition, exhibiting clear operational meanings. In particular, we associate each learning setting with a dependence component and formulate learning tasks as finding corresponding feature approximations. We propose a nesting technique, which provides systematic algorithm designs for learning the optimal features from data samples with off-the-shelf network architectures and optimizers. We further demonstrate multivariate learning applications, including conditional inference and multimodal learning, where we present the optimal features and reveal their connections to classical approaches.

Neural Feature Learning in Function Space

TL;DR

Abstract

Paper Structure (102 sections, 28 theorems, 257 equations, 24 figures, 1 table)

This paper contains 102 sections, 28 theorems, 257 equations, 24 figures, 1 table.

Introduction
Notations and Preliminaries
Feature Geometry
Vector Space
Feature Space
Neural Feature Extractors
Joint Functions
Feature Geometry on Data Samples
Modal Decomposition
Definitions and Properties
Constrained Modal Decomposition
Statistical Dependence and Induced Features
Weak Dependence and Local Geometric Analyses
Dependence Approximation and Feature Learning
Low Rank Approximation of Statistical Dependence
...and 87 more sections

Key Result

Proposition 5

Suppose $\mathcal{G}_{\mathcal{X}}$ and $\mathcal{G}_{\mathcal{Y}}$ are subspace of $\mathcal{F}_{\mathcal{X}}$ and $\mathcal{F}_{\mathcal{Y}}$, respectively. Then, for all $\gamma \in \mathcal{F}_{\mathcal{X} \times \mathcal{Y}}$ and $k \geq 1$, we have $\mathop{\mathrm{\zeta}}\nolimits_{k}(\gamma| where we have defined $\mathcal{G}_{\mathcal{X}}^{\,k} \triangleq \left(\mathcal{G}_{\mathcal{X}}\r

Figures (24)

Figure 1: Schematic diagram of a general feature-centric learning system
Figure 2: Schematic representations of neural feature extractors. \ref{['fig:nfe:f']}: a general feature extractor $f \in \mathcal{F}_{\mathcal{Z}}^{\,k}$; \ref{['fig:nfe:wb']}: a linear layer with weight matrix $W$ and bias $\underline{b}$; \ref{['fig:nfe:sspc']}: the composition of feature extractor blocks, where each dimension of the output lies in the feature subspace $\mathop{\mathrm{span}}\nolimits\{\phi\}$.
Figure 3: Features $f$, $g$ as the output of linear layers. The linear layers are represented as triangle modules, with inputs $\phi, \psi$, and weight matrices $W_{\sf x}$, $W_{\sf y}$, respectively.
Figure 4: A classification DNN for predicting label $Y$ based on the input $X$. All layers before the classification layer are represented as feature extractor $f$. The weight and bias associated with each class $Y = y$ are denoted by $g(y)$ and $b(y)$, respectively, which give weight matrix $G^\mathrm{T}$ and bias vector $\underline{b}$ with $G = [g(1), \dots, g(|\mathcal{Y}|)], \underline{b} = [b(1), \dots, g(|\mathcal{Y})]^\mathrm{T}$. The softmax module outputs a posterior probability, parameterized by $f$, $g$ and $b$.
Figure 5: Nesting technique for modal decomposition: the nested H-score is computed with a nested architecture, where "$\mathbin{+\mkern-10mu+}$" denotes the concatenation operation of two features.
...and 19 more figures

Theorems & Definitions (44)

Definition 1
Definition 2
Remark 3
Remark 4
Proposition 5
Corollary 6: HGR Maximal Correlation Functions
Proposition 7
Example 1: Canonical Correlation Analysis
Definition 8: $\epsilon$-Dependence
Lemma 9: HuangMWZ2024
...and 34 more

Neural Feature Learning in Function Space

TL;DR

Abstract

Neural Feature Learning in Function Space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (44)