Table of Contents
Fetching ...

What to Do When Your Discrete Optimization Is the Size of a Neural Network?

Hugo Silva, Martha White

TL;DR

The paper analyzes discrete optimization problems in neural networks through PB optimization, contrasting continuation-path (CP) methods with Monte Carlo (MC) gradient estimators and exploring hybrids. It formalizes PB objectives, surveys CP and MC approaches, and analyzes their theoretical limits, including extrapolation hazards and dependence on current distributions. Through microworld and large-NN experiments (including pruning and masked-network tasks), it finds CP methods can struggle in small-scale problems but gain practical value with large, overparameterized networks, while MC methods are hampered by dimensionality and unwanted generalization, often failing to outperform CP in pruning scenarios. The work provides a comprehensive, methodical comparison and highlights that neither CP nor MC alone offers a universal solution for PB optimization in modern neural networks, suggesting future work should focus on problem- and architecture-aware designs and refined hybrids.

Abstract

Oftentimes, machine learning applications using neural networks involve solving discrete optimization problems, such as in pruning, parameter-isolation-based continual learning and training of binary networks. Still, these discrete problems are combinatorial in nature and are also not amenable to gradient-based optimization. Additionally, classical approaches used in discrete settings do not scale well to large neural networks, forcing scientists and empiricists to rely on alternative methods. Among these, two main distinct sources of top-down information can be used to lead the model to good solutions: (1) extrapolating gradient information from points outside of the solution set (2) comparing evaluations between members of a subset of the valid solutions. We take continuation path (CP) methods to represent using purely the former and Monte Carlo (MC) methods to represent the latter, while also noting that some hybrid methods combine the two. The main goal of this work is to compare both approaches. For that purpose, we first overview the two classes while also discussing some of their drawbacks analytically. Then, on the experimental section, we compare their performance, starting with smaller microworld experiments, which allow more fine-grained control of problem variables, and gradually moving towards larger problems, including neural network regression and neural network pruning for image classification, where we additionally compare against magnitude-based pruning.

What to Do When Your Discrete Optimization Is the Size of a Neural Network?

TL;DR

The paper analyzes discrete optimization problems in neural networks through PB optimization, contrasting continuation-path (CP) methods with Monte Carlo (MC) gradient estimators and exploring hybrids. It formalizes PB objectives, surveys CP and MC approaches, and analyzes their theoretical limits, including extrapolation hazards and dependence on current distributions. Through microworld and large-NN experiments (including pruning and masked-network tasks), it finds CP methods can struggle in small-scale problems but gain practical value with large, overparameterized networks, while MC methods are hampered by dimensionality and unwanted generalization, often failing to outperform CP in pruning scenarios. The work provides a comprehensive, methodical comparison and highlights that neither CP nor MC alone offers a universal solution for PB optimization in modern neural networks, suggesting future work should focus on problem- and architecture-aware designs and refined hybrids.

Abstract

Oftentimes, machine learning applications using neural networks involve solving discrete optimization problems, such as in pruning, parameter-isolation-based continual learning and training of binary networks. Still, these discrete problems are combinatorial in nature and are also not amenable to gradient-based optimization. Additionally, classical approaches used in discrete settings do not scale well to large neural networks, forcing scientists and empiricists to rely on alternative methods. Among these, two main distinct sources of top-down information can be used to lead the model to good solutions: (1) extrapolating gradient information from points outside of the solution set (2) comparing evaluations between members of a subset of the valid solutions. We take continuation path (CP) methods to represent using purely the former and Monte Carlo (MC) methods to represent the latter, while also noting that some hybrid methods combine the two. The main goal of this work is to compare both approaches. For that purpose, we first overview the two classes while also discussing some of their drawbacks analytically. Then, on the experimental section, we compare their performance, starting with smaller microworld experiments, which allow more fine-grained control of problem variables, and gradually moving towards larger problems, including neural network regression and neural network pruning for image classification, where we additionally compare against magnitude-based pruning.
Paper Structure (63 sections, 11 theorems, 143 equations, 20 figures, 7 tables, 6 algorithms)

This paper contains 63 sections, 11 theorems, 143 equations, 20 figures, 7 tables, 6 algorithms.

Key Result

Theorem 1

Assuming a vector $\boldsymbol{\theta} \in [0,1]^d$ and considering $\textbf{z} \in \{0,1\}^d$ such that $z_i \sim Ber(\theta_i)$ and $z_i$, $z_j$ independent for $i \neq j$, with $\mathscr{P}_J(\cdot)$ as defined in sec:pb_basics we have:

Figures (20)

  • Figure 1: Effect of varying the temperature of a sigmoid.
  • Figure 2: Completely different choices of $J(\cdot)$ lead to the same PB optimization problem.
  • Figure 3: $J(\cdot)$ for the counter examples.
  • Figure 4: Performance of different learning rates ($\alpha$) for the problem in \ref{['ex:deceiving_piecewise']}.
  • Figure 5: Contour plots of $\mathbb{E}[J(\cdot)]$ for $d=2$, blue regions correspond to lower values. Left (a,b): Arrows indicate the summands from the equation indicated in the caption. If $p_{\boldsymbol{\theta}}(\boldsymbol{\zeta}_2)$ is low, the gradient may not point towards $\textbf{z}^* = \boldsymbol{\zeta}_2$, such as in (a). Right (c): Illustration for \ref{['ex:mc_bad_gen_1']}. Small arrows correspond to the gradient field. Darker arrows correspond to lower gradient magnitudes.
  • ...and 15 more figures

Theorems & Definitions (21)

  • Example 1: Training of Binary Neural Networks
  • Example 2: Neural Network Pruning
  • Example 3: Sequential Task Learning
  • Example 4
  • Example 5
  • Theorem 1: Restatement of Proposition 5 of boros2002pseudo
  • Theorem 2
  • Theorem 3
  • Example 6
  • Theorem 4
  • ...and 11 more