Table of Contents
Fetching ...

Error Feedback Can Accurately Compress Preconditioners

Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh

TL;DR

The paper tackles the prohibitive memory cost of full-matrix preconditioners in deep learning. It introduces EFCP, an error-feedback-based compression scheme that dramatically reduces the gradient-history memory through sparsity or low-rank representations, while preserving convergence properties. The authors provide algorithmic, data-structure, and CUDA-kernel innovations (including a dynamic sparse ring-buffer) and offer partial theoretical guarantees, complemented by extensive experiments on ImageNet-ResNet and BERT-GLUE showing memory reductions with negligible or no accuracy loss. This approach enables practical usage of full-matrix preconditioners on single-GPU hardware, unlocking the benefits of second-order updates for large-scale models.

Abstract

Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to small-scale models, as they must store a sliding window of gradients, whose memory requirements are multiplicative in the model dimension. In this paper, we address this issue via a novel and efficient error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence. Specifically, our approach compresses the gradient information via sparsification or low-rank compression \emph{before} it is fed into the preconditioner, feeding the compression error back into future iterations. Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC. Our code is available at \url{https://github.com/IST-DASLab/EFCP}.

Error Feedback Can Accurately Compress Preconditioners

TL;DR

The paper tackles the prohibitive memory cost of full-matrix preconditioners in deep learning. It introduces EFCP, an error-feedback-based compression scheme that dramatically reduces the gradient-history memory through sparsity or low-rank representations, while preserving convergence properties. The authors provide algorithmic, data-structure, and CUDA-kernel innovations (including a dynamic sparse ring-buffer) and offer partial theoretical guarantees, complemented by extensive experiments on ImageNet-ResNet and BERT-GLUE showing memory reductions with negligible or no accuracy loss. This approach enables practical usage of full-matrix preconditioners on single-GPU hardware, unlocking the benefits of second-order updates for large-scale models.

Abstract

Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to small-scale models, as they must store a sliding window of gradients, whose memory requirements are multiplicative in the model dimension. In this paper, we address this issue via a novel and efficient error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence. Specifically, our approach compresses the gradient information via sparsification or low-rank compression \emph{before} it is fed into the preconditioner, feeding the compression error back into future iterations. Experiments on deep neural networks show that this approach can compress full-matrix preconditioners to up to 99\% sparsity without accuracy loss, effectively removing the memory overhead of full-matrix preconditioners such as GGT and M-FAC. Our code is available at \url{https://github.com/IST-DASLab/EFCP}.
Paper Structure (66 sections, 2 theorems, 16 equations, 5 figures, 12 tables, 5 algorithms)

This paper contains 66 sections, 2 theorems, 16 equations, 5 figures, 12 tables, 5 algorithms.

Key Result

Lemma 11.1

Given a PSD matrix $H$, let the operator $T_{H}$ defined as Then it always holds that

Figures (5)

  • Figure 1: Validation (Top-1) accuracy for ResNet18 training from scratch on ImageNet-1K. M-FAC and 99%-Sparse M-FAC reach approximately 1% higher validation accuracy relative to SGD using the well-tuned FFCV recipe FFCV.
  • Figure 2: Training loss and validation (Top-1) accuracy for ResNet18 training from scratch on ImageNet-1K under a detailed sparsity sweep, between $1\%$ and $0.04\%$ density. Dense M-FAC outperforms SGD in this setting by about $1\%$, which is recovered by S-MFAC at 1% density. Even with $0.17\%$ gradient density, M-FAC slightly outperforms SGD in terms of training loss and test accuracy.
  • Figure 3: Norms of the error feedback $||\xi_t||_2$, norm of the gradient $||g_t||_2$ and the EF metric value for BERT-Base.
  • Figure 4: Norms of the error feedback $||\xi_t||_2$, norm of the gradient $||g_t||_2$ and the EF metric value for ResNet-20.
  • Figure 5: Comparison between S-MFAC and Shampoo on ResNet-18 / ImageNet

Theorems & Definitions (2)

  • Lemma 11.1
  • Lemma 11.2