Table of Contents
Fetching ...

XConv: Low-memory stochastic backpropagation for convolutional layers

Anirudh Thatipelli, Jeffrey Sam, Mathias Louboutin, Ali Siahkoohi, Rongrong Wang, Felix J. Herrmann

TL;DR

XConv is proposed, a drop-in replacement for standard convolutional layers that preserves standard backpropagation, imposes no architectural constraints, and integrates seamlessly into existing codebases, achieving performance comparable to exact gradient methods across classification, generative modeling, super-resolution, inpainting, and segmentation.

Abstract

Training convolutional neural networks at scale demands substantial memory, largely due to storing intermediate activations for backpropagation. Existing approaches -- such as checkpointing, invertible architectures, or gradient approximation methods like randomized automatic differentiation -- either incur significant computational overhead, impose architectural constraints, or require non-trivial codebase modifications. We propose XConv, a drop-in replacement for standard convolutional layers that addresses all three limitations: it preserves standard backpropagation, imposes no architectural constraints, and integrates seamlessly into existing codebases. XConv exploits the algebraic structure of convolutional layer gradients, storing highly compressed activations and approximating weight gradients via multi-channel randomized trace estimation. We establish convergence guarantees and derive error bounds for the proposed estimator, showing that the variance of the resulting gradient errors is comparable to that of stochastic gradient descent. Empirically, XConv achieves performance comparable to exact gradient methods across classification, generative modeling, super-resolution, inpainting, and segmentation -- with gaps that narrow as the number of probing vectors increases -- while reducing memory usage by a factor of two or more and remaining computationally competitive with optimized convolution implementations.

XConv: Low-memory stochastic backpropagation for convolutional layers

TL;DR

XConv is proposed, a drop-in replacement for standard convolutional layers that preserves standard backpropagation, imposes no architectural constraints, and integrates seamlessly into existing codebases, achieving performance comparable to exact gradient methods across classification, generative modeling, super-resolution, inpainting, and segmentation.

Abstract

Training convolutional neural networks at scale demands substantial memory, largely due to storing intermediate activations for backpropagation. Existing approaches -- such as checkpointing, invertible architectures, or gradient approximation methods like randomized automatic differentiation -- either incur significant computational overhead, impose architectural constraints, or require non-trivial codebase modifications. We propose XConv, a drop-in replacement for standard convolutional layers that addresses all three limitations: it preserves standard backpropagation, imposes no architectural constraints, and integrates seamlessly into existing codebases. XConv exploits the algebraic structure of convolutional layer gradients, storing highly compressed activations and approximating weight gradients via multi-channel randomized trace estimation. We establish convergence guarantees and derive error bounds for the proposed estimator, showing that the variance of the resulting gradient errors is comparable to that of stochastic gradient descent. Empirically, XConv achieves performance comparable to exact gradient methods across classification, generative modeling, super-resolution, inpainting, and segmentation -- with gaps that narrow as the number of probing vectors increases -- while reducing memory usage by a factor of two or more and remaining computationally competitive with optimized convolution implementations.

Paper Structure

This paper contains 41 sections, 7 theorems, 40 equations, 26 figures, 5 tables.

Key Result

Proposition 1

Let $\mathbf{A} \in \mathbb{R}^{N \times N}$ be a square matrix and let the probing vectors be i.i.d. Gaussian with $0$ mean and unit variance. Then for any small number $\delta >0$, with probability $1-\delta$, we have

Figures (26)

  • Figure 1: Multi-channel probing illustration and effect of orthogonalization. (a) Schematic of the three-step trace estimation procedure for a single sub-block. (b) Probing matrices $\mathbf{Z}$ and their Gram matrices $\mathbf{Z}^\top\mathbf{Z}$ before (left) and after (right) block orthogonalization, showing reduced cross-channel interference.
  • Figure 2: XConv can be integrated either by directly instantiating XConv layers (left) or by converting existing convolutional models using a single API call (right).
  • Figure 3: Gradient analysis on CIFAR-10. (a) Per-weight gradient estimates for four convolutional layers under different probing strategies. (b) Standard deviation of gradients across 40 mini-batches for varying batch sizes and probing vectors $r$.
  • Figure 4: Average gradient error vs. image dimension for SqueezeNet. Results are shown for probing vectors $r \in \{16, 128\}$ over $10$ runs. Bold numbers indicate the maximum batch-size that fits in memory for each method (blue: standard convolution, red: XConv ) under a fixed $16$ GB memory budget. Across all probing-vector settings, XConv consistently permits larger batch sizes than standard convolution. The AGE is higher for XConv, but the gap reduces progressively with increasing image dimension $N$, becoming small at $N = 1024$ for probing vector $r = 128$.
  • Figure 5: Average gradient error vs. image dimension for U-Net. Results are shown for probing vectors (a) $r = 16$ and (b) $r = 128$, over 10 runs. Numbers at each data point indicate the maximum batch size that fits in memory for each method (blue: standard convolution, red: XConv) under a fixed memory budget. XConv increases AGE relative to Conv, but remains within one order of magnitude across all image dimensions. The gap narrows as $r$ increases, indicating that approximation noise can be controlled by adjusting the number of probing vectors.
  • ...and 21 more figures

Theorems & Definitions (11)

  • Proposition 1
  • Theorem 1: Succinct version
  • Proposition 2
  • Lemma 2: Theorem 5 of cortinovis2020randomized
  • proof : Proof of Proposition \ref{['pro:A']}
  • Lemma 3
  • proof : Proof of Lemma \ref{['lm:off']}
  • Lemma 4
  • proof
  • Theorem 1
  • ...and 1 more