Table of Contents
Fetching ...

Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking

Afroditi Kolomvaki, Fangshuo Liao, Evan Dramko, Ziyun Guang, Anastasios Kyrillidis

TL;DR

Using a Neural Tangent Kernel (NTK) analysis, it is demonstrated that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region proportional to the mask's variance.

Abstract

We investigate the convergence guarantee of two-layer neural network training with Gaussian randomly masked inputs. This scenario corresponds to Gaussian dropout at the input level, or noisy input training common in sensor networks, privacy-preserving training, and federated learning, where each user may have access to partial or corrupted features. Using a Neural Tangent Kernel (NTK) analysis, we demonstrate that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region proportional to the mask's variance. A key technical contribution is resolving the randomness within the non-linear activation, a problem of independent interest.

Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking

TL;DR

Using a Neural Tangent Kernel (NTK) analysis, it is demonstrated that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region proportional to the mask's variance.

Abstract

We investigate the convergence guarantee of two-layer neural network training with Gaussian randomly masked inputs. This scenario corresponds to Gaussian dropout at the input level, or noisy input training common in sensor networks, privacy-preserving training, and federated learning, where each user may have access to partial or corrupted features. Using a Neural Tangent Kernel (NTK) analysis, we demonstrate that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region proportional to the mask's variance. A key technical contribution is resolving the randomness within the non-linear activation, a problem of independent interest.
Paper Structure (33 sections, 35 theorems, 381 equations, 10 figures)

This paper contains 33 sections, 35 theorems, 381 equations, 10 figures.

Key Result

Theorem 4.2

Let ${\mathbf{u}}_{i,r} = {\mathbf{w}}_r\odot{\mathbf{x}}_i$. Define the smoothed activation and neural network as: Let $\phi_{\max},\psi_{\max}, B_y, R_{{\mathbf{u}}}$, and $R_{{\mathbf{w}}}$ be defined in Definition defin:thm_quantities. If $B_y \leq 3\sqrt{m} R_{{\mathbf{w}}}$, then we have that: with the magnitude of $\mathcal{E}$ bounded by:

Figures (10)

  • Figure 1: (a). Effect of the noise standard deviation $\kappa$ on the shape of the smoothed activation function $\hat{\sigma}(z; \kappa) = z \cdot \Phi_1(z / (\kappa \|\mathbf{w}\odot\mathbf{x}\|_2))$, where $z = \mathbf{w}^\top\mathbf{x}$. For this visualization, $\|\mathbf{w}\odot\mathbf{x}\|_2$ is held constant at $1.0$. As $\kappa$ increases, the activation becomes progressively smoother compared to the standard ReLU (dotted black line). For small $\kappa$ (e.g., $\kappa=0.01$), $\hat{\sigma}$ closely approximates the standard ReLU. (b). Theoretical smoothed activation $\hat{\sigma}(\mathbf{w}, \mathbf{x})$ versus its empirical estimate $\mathbb{E}_{\mathbf{c}}[\sigma(\mathbf{w}^\top(\mathbf{x} \odot \mathbf{c}))]$ for a fixed pre-activation value $\mathbf{w}^\top\mathbf{x} \approx 0.77$ (actual value depends on fixed $\mathbf{w},\mathbf{x}$) as the noise standard deviation $\kappa$ varies. The close match across a range of---relatively small---$\kappa$ values validates the theoretical model for $\hat{\sigma}$. Note that this behavior consistently follows empirically for different $\mathbf{w},\mathbf{x}$ values.
  • Figure 2: Smoothed ReLU under multiplicative Gaussian input masking for fixed $\kappa=0.2$, where $z=\mathbf{w}^\top\mathbf{x}$ and $\sigma=\kappa\|\mathbf{w}\odot\mathbf{x}\|_2$. (a) Exact closed-form expectation $\tilde{\sigma}(\mathbf{w},\mathbf{x})=\mathbb{E}_{\mathbf{c}}[\sigma(\mathbf{w}^\top(\mathbf{x}\odot\mathbf{c}))]=z \Phi(z/\sigma)+\sigma \varphi(z/\sigma)$ (as shown in \ref{['eq:expect_act']})) matches the Monte Carlo estimate. (b) Proxy smoothed activation $\hat{\sigma}(\mathbf{w},\mathbf{x})=z \Phi(z/\sigma)$ (used in Theorem \ref{['thm:exp_loss_approx']}) differs mainly near $z\approx 0$ due to the missing $\sigma \varphi(z/\sigma)$ term.
  • Figure 3: (a). Training loss $\mathcal{L}(\mathbf{W}_k)$ (log-scale) versus training iteration for a two-layer ReLU network ($n=500, d=20, m=100$) trained with full-batch gradient descent under different levels of input multiplicative Gaussian noise standard deviation $\kappa$. (b). Distributed training with Gaussian mask for differen $\kappa$ and number of local steps.
  • Figure 4: Test accuracy versus multiplicative Gaussian noise strength $\kappa$ for (a) a 1-hidden-layer MLP and (b) a CNN, trained on CIFAR-10. Small noise levels ($\kappa \approx 0.2)$ can improve generalization for the MLP, likely due to regularization effects. In contrast, the CNN exhibits robustness by maintaining baseline accuracy. Beyond this point, accuracy degrades monotonically for both architectures as noise corrupts the training signal.
  • Figure 5: Attack AUC on the target model when it is (a) an MLP and (b) a CNN. Higher values indicate greater privacy leakage. Training with MG noise (larger $\kappa$) consistently reduces attack success.
  • ...and 5 more figures

Theorems & Definitions (73)

  • Definition 4.1
  • Theorem 4.2
  • Remark 4.3
  • Remark 4.4
  • Remark 4.5
  • Remark 4.6
  • Lemma 4.7
  • Lemma 4.8
  • Theorem 4.9
  • Remark 4.10
  • ...and 63 more