Table of Contents
Fetching ...

Efficient Sparse Training with Structured Dropout

Andy Lo

TL;DR

The empirical results demonstrate that SparseDrop provides similar, or sometimes even better, regularisation properties as standard dropout, and suggests its potential as a drop-in replacement to standard dropout with faster training speeds.

Abstract

Dropout is a common regularisation technique in deep learning that improves generalisation. Even though it introduces sparsity and thus potential for higher throughput, it usually cannot bring speed-ups on GPUs due to its unstructured nature. In this project, I experiment with SparseDrop, a structured, hardware-friendly variant of dropout that can exploit such sparsity. I provide a CUDA implementation of SparseDrop, achieving speed-ups against its dense counterpart even at low sparsity levels. The empirical results demonstrate that SparseDrop provides similar, or sometimes even better, regularisation properties as standard dropout. This suggests its potential as a drop-in replacement to standard dropout with faster training speeds. The source code is available at https://github.com/andylolu2/sparse-dropout

Efficient Sparse Training with Structured Dropout

TL;DR

The empirical results demonstrate that SparseDrop provides similar, or sometimes even better, regularisation properties as standard dropout, and suggests its potential as a drop-in replacement to standard dropout with faster training speeds.

Abstract

Dropout is a common regularisation technique in deep learning that improves generalisation. Even though it introduces sparsity and thus potential for higher throughput, it usually cannot bring speed-ups on GPUs due to its unstructured nature. In this project, I experiment with SparseDrop, a structured, hardware-friendly variant of dropout that can exploit such sparsity. I provide a CUDA implementation of SparseDrop, achieving speed-ups against its dense counterpart even at low sparsity levels. The empirical results demonstrate that SparseDrop provides similar, or sometimes even better, regularisation properties as standard dropout. This suggests its potential as a drop-in replacement to standard dropout with faster training speeds. The source code is available at https://github.com/andylolu2/sparse-dropout

Paper Structure

This paper contains 27 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Standard versus block-wise sparse GEMM for one $M$-block with $M_{blk} = N_{blk} = K_{blk} = 2$. The blue matrix represents a fragment of input $\mathbf{\bm{A}}$, the green matrix is input $\mathbf{\bm{B}}$ and the red matrix is the output $\mathbf{\bm{C}}$. Solid lines represent the block-wise access granularity by the GPU. Ignored elements are coloured in gray. The sparsity level is 33% in all cases but only the structured variants can benefit from it by skipping over entire blocks.
  • Figure 2: Example of block splitting. The logical block size is $2 \times 2$ but mask (b) operates on $2 \times 1$ while (c) operates on $1 \times 2$. The semantics is the same across all three masks.
  • Figure 3: Benchmark of SparseDrop against baseline methods, measured on RTX 2060 Max-Q with GPU clock locked at 1200MHz and clearing the L2 cache between each measurement.
  • Figure 4: Total time (forward + backward) of models at various sparisty levels. Measured on RTX 2060 Max-Q (GPU clock locked at 1200MHz) with automatic mixed precision.