Table of Contents
Fetching ...

Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks

Changwoo Lee, Hun-Seok Kim

TL;DR

This paper proposes a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent based on the Gaussian-Dirichlet kernel, and defines a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters.

Abstract

This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. On the image and language tasks, our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices.

Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks

TL;DR

This paper proposes a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent based on the Gaussian-Dirichlet kernel, and defines a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters.

Abstract

This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. On the image and language tasks, our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices.
Paper Structure (36 sections, 8 theorems, 29 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 36 sections, 8 theorems, 29 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Let $n,K,s$ be positive integers satisfying $Ks\ge n$. Then any $n$-by-$n$ rank-$\frac{Ks}{n}$ matrices and $(n,\frac{K}{s},s)$-block-sparse matrices are $(n,K,s)$-GBLR. Also, any $(n,K,s,1)$-block-low-rank matrices are $(n,K,s)$-GBLR if $K=(n/s)^2$.

Figures (8)

  • Figure 1: Comparison of block-sparse, block-low-rank, and our proposed Generalized block-low-rank matrices.
  • Figure 2: Left: An example of a GBLR matrix with 4 blocks. A block is generated from the structural parameters $(w^R,l^R),(w^C,l^C)$ and the content parameters $({\bm{u}},{\bm{v}})$, where $(w^R,l^R)$ and $(w^C,l^C)$ form binary masks ${\bm{m}}_{(w^R,l^R)}$ and ${\bm{m}}_{(w^C,l^C)}$, respectively. Note that overlapped regions can have a rank higher than one. Right: Efficient Matrix-Vector Product computations using cropped content parameters and structural parameters. The structural parameters locate the input and output indices/addresses to read and write.
  • Figure 3: Comparison between Boxcar mask and Gaudi masks in the time domain with different smoothing factors $\sigma$. The Gaudi mask converges to the Boxcar mask as $\sigma$ grows.
  • Figure 4: ImageNet accuracy after fine-tuning ViT-Base weights replaced with structured weight matrices. Dense: the original ViT-Base model.
  • Figure 5: Accuracy-Cost trade-off of models trained from scratch on CIFAR-10/100 dataset.
  • ...and 3 more figures

Theorems & Definitions (19)

  • Definition 1
  • Theorem 1
  • Theorem 2: Closed under structural interpolation
  • Theorem 3
  • Corollary 4
  • Definition 2: Block-sparse matrix
  • Definition 3: Block-low-rank matrix amestoy2015improvingjeannerod2019improving
  • proof
  • proof
  • Lemma 5
  • ...and 9 more