Table of Contents
Fetching ...

A Combinatorial Theory of Dropout: Subnetworks, Graph Geometry, and Generalization

Sahil Rajesh Dhayalkar

TL;DR

The paper introduces a combinatorial, graph-based theory of dropout by treating the space of dropout subnetworks as a $d$-dimensional hypercube and dropout as a random walk over this subnetwork graph. It defines a subnetwork contribution score $C(f)$ to quantify generalization and proves that good subnetworks form dense, low-resistance clusters, with their number growing exponentially with network width. Theoretical results span mean-field behavior in linear regimes, ensemble compression, smoothness of the generalization landscape, PAC-Bayes generalization bounds, and connections between generalization and graph resistance. Extensive experiments across MNIST and CIFAR-10 validate the claims, showing dropout implicitly samples from a structured, robust ensemble of subnetworks and highlighting potential directions for mask-guided regularization and subnetwork-aware optimization.

Abstract

We propose a combinatorial and graph-theoretic theory of dropout by modeling training as a random walk over a high-dimensional graph of binary subnetworks. Each node represents a masked version of the network, and dropout induces stochastic traversal across this space. We define a subnetwork contribution score that quantifies generalization and show that it varies smoothly over the graph. Using tools from spectral graph theory, PAC-Bayes analysis, and combinatorics, we prove that generalizing subnetworks form large, connected, low-resistance clusters, and that their number grows exponentially with network width. This reveals dropout as a mechanism for sampling from a robust, structured ensemble of well-generalizing subnetworks with built-in redundancy. Extensive experiments validate every theoretical claim across diverse architectures. Together, our results offer a unified foundation for understanding dropout and suggest new directions for mask-guided regularization and subnetwork optimization.

A Combinatorial Theory of Dropout: Subnetworks, Graph Geometry, and Generalization

TL;DR

The paper introduces a combinatorial, graph-based theory of dropout by treating the space of dropout subnetworks as a -dimensional hypercube and dropout as a random walk over this subnetwork graph. It defines a subnetwork contribution score to quantify generalization and proves that good subnetworks form dense, low-resistance clusters, with their number growing exponentially with network width. Theoretical results span mean-field behavior in linear regimes, ensemble compression, smoothness of the generalization landscape, PAC-Bayes generalization bounds, and connections between generalization and graph resistance. Extensive experiments across MNIST and CIFAR-10 validate the claims, showing dropout implicitly samples from a structured, robust ensemble of subnetworks and highlighting potential directions for mask-guided regularization and subnetwork-aware optimization.

Abstract

We propose a combinatorial and graph-theoretic theory of dropout by modeling training as a random walk over a high-dimensional graph of binary subnetworks. Each node represents a masked version of the network, and dropout induces stochastic traversal across this space. We define a subnetwork contribution score that quantifies generalization and show that it varies smoothly over the graph. Using tools from spectral graph theory, PAC-Bayes analysis, and combinatorics, we prove that generalizing subnetworks form large, connected, low-resistance clusters, and that their number grows exponentially with network width. This reveals dropout as a mechanism for sampling from a robust, structured ensemble of well-generalizing subnetworks with built-in redundancy. Extensive experiments validate every theoretical claim across diverse architectures. Together, our results offer a unified foundation for understanding dropout and suggest new directions for mask-guided regularization and subnetwork optimization.

Paper Structure

This paper contains 35 sections, 39 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Histogram of output differences between dropout-averaged output and scaled model output for Lemma 1. The differences are tightly concentrated around $0.01$, validating the lemma.
  • Figure 2: Distribution of squared norms $\|\theta \odot M\|^2$ over 10,000 random dropout masks. The red dashed line indicates the expected value $p \cdot \|\theta\|^2$.
  • Figure 3: Fraction of subnetworks in $\mathcal{G}_\epsilon$ for different thresholds $\epsilon$.
  • Figure 4: Contribution Scores versus Frequency for 1000 sampled subnetworks.
  • Figure 5: Scatter plots of effective resistance vs contribution score differences across 1-bit-flip subnetworks, for each model-dataset configuration.
  • ...and 1 more figures