Table of Contents
Fetching ...

Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

Viktoriia Chekalina, Daniil Moskovskiy, Daria Cherniuk, Maxim Kurkin, Andrey Kuznetsov, Evgeny Frolov

TL;DR

GFWSVD addresses the high cost of the full Fisher Information Matrix by introducing a scalable Kronecker-factorized approximation and a generalized SVD that captures both diagonal and off-diagonal parameter dependencies. The method generalizes FWSVD by using two non-diagonal sensitivity matrices and derives an optimal rank-$r$ compression via $\widehat{W}_r = L_B^{-\top} \widetilde{W}_r L_A^{-1}$, computed through a rank-1 SVD on a permuted Fisher matrix and Lanczos-based optimization to achieve cubic complexity. Empirically, GFWSVD consistently outperforms diagonal FI-based methods and activation-based baselines on encoder (BERT/GLUE) and decoder (LLaMA-2/MMLU) tasks, especially at low ranks, with notable gains at 20% compression on MMLU (5pp over FWSVD, 3pp over SVD-LLM, 6pp over ASVD). The approach demonstrates practical, task-aware compression that preserves downstream performance while reducing computational demands, and points to future work on higher-rank Kronecker expansions and cross-layer compression strategies.

Abstract

The Fisher information is a fundamental concept for characterizing the sensitivity of parameters in neural networks. However, leveraging the full observed Fisher information is too expensive for large models, so most methods rely on simple diagonal approximations. While efficient, this approach ignores parameter correlations, often resulting in reduced performance on downstream tasks. In this work, we mitigate these limitations and propose Generalized Fisher-Weighted SVD (GFWSVD), a post-training LLM compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix, providing a more accurate reflection of parameter importance. To make the method tractable, we introduce a scalable adaptation of the Kronecker-factored approximation algorithm for the observed Fisher information. We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines. For example, at a 20 compression rate on the MMLU benchmark, our method outperforms FWSVD, which is based on a diagonal approximation of the Fisher information, by 5 percent, SVD-LLM by 3 percent, and ASVD by 6 percent compression rate.

Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

TL;DR

GFWSVD addresses the high cost of the full Fisher Information Matrix by introducing a scalable Kronecker-factorized approximation and a generalized SVD that captures both diagonal and off-diagonal parameter dependencies. The method generalizes FWSVD by using two non-diagonal sensitivity matrices and derives an optimal rank- compression via , computed through a rank-1 SVD on a permuted Fisher matrix and Lanczos-based optimization to achieve cubic complexity. Empirically, GFWSVD consistently outperforms diagonal FI-based methods and activation-based baselines on encoder (BERT/GLUE) and decoder (LLaMA-2/MMLU) tasks, especially at low ranks, with notable gains at 20% compression on MMLU (5pp over FWSVD, 3pp over SVD-LLM, 6pp over ASVD). The approach demonstrates practical, task-aware compression that preserves downstream performance while reducing computational demands, and points to future work on higher-rank Kronecker expansions and cross-layer compression strategies.

Abstract

The Fisher information is a fundamental concept for characterizing the sensitivity of parameters in neural networks. However, leveraging the full observed Fisher information is too expensive for large models, so most methods rely on simple diagonal approximations. While efficient, this approach ignores parameter correlations, often resulting in reduced performance on downstream tasks. In this work, we mitigate these limitations and propose Generalized Fisher-Weighted SVD (GFWSVD), a post-training LLM compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix, providing a more accurate reflection of parameter importance. To make the method tractable, we introduce a scalable adaptation of the Kronecker-factored approximation algorithm for the observed Fisher information. We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines. For example, at a 20 compression rate on the MMLU benchmark, our method outperforms FWSVD, which is based on a diagonal approximation of the Fisher information, by 5 percent, SVD-LLM by 3 percent, and ASVD by 6 percent compression rate.

Paper Structure

This paper contains 20 sections, 1 theorem, 32 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathbf{W} \in \mathbb{R}^{n \times m}$ represent some parameter weights matrix of a single-layer linear neural network. Suppose that the following conditions hold. Under these conditions, the best rank-$r$ approximation that minimizes the expected increase in the loss after low-rank decomposition of $\mathbf{W}^{\star}$ is given by: where $\mathbf{A} = \mathbf{L}_\mathbf{A}^{} \mathbf{L}_\

Figures (4)

  • Figure 1: Generalization of the Weighted SVD frameworks. For standard SVD, the transformation matrices are identity matrices. For FWSVD, the left matrix is diagonal but not identity, and the right matrix is identity. For GFWSVD, both matrices are non-diagonal.
  • Figure 2: Empirical runtime for computing the Kronecker decomposition of the Fisher matrix for weight matrices of varying sizes.
  • Figure 3: Macro-averaged GLUE performance of bert-base-uncased model for different compression ranks.
  • Figure 4: Average MMLU performance of llama-2-7b-chat model for different compression rates.

Theorems & Definitions (2)

  • Theorem 1
  • proof