Table of Contents
Fetching ...

KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models

Fan Wang, Juyong Jiang, Chansung Park, Sunghun Kim, Jing Tang

TL;DR

KaSA tackles the computational and memory burden of full fine-tuning by introducing a knowledge-aware, singular-value–based PEFT. It first performs knowledge-based SVD truncation to form a world-aligned base W_{world} and then learns task updates in SVD form ΔW = ΔU ΔΣ ΔV^T, where the diagonal ΔΣ are knowledge-aware and adapt to task relevance, aided by orthogonal regularization and a two-term loss. This design reduces interference from noisy or irrelevant knowledge and preserves a coherent representation space, yielding consistent improvements over FFT and 14 baselines across NLU, NLG, instruction following, and commonsense reasoning benchmarks; results include strong GLUE performance with limited trainable parameters and robust cross-model gains on synthetic instruction-following tasks. The approach enables efficient, scalable adaptation with negligible inference delay and supports multi-task updates using a single base model, highlighting practical impact for deploying large language models in domain-specific settings.

Abstract

The increasing sizes of large language models (LLMs) result in significant computational overhead and memory usage when adapting these models to specific tasks or domains. Various parameter-efficient fine-tuning (PEFT) methods have been devised to mitigate these challenges by training a small set of parameters for the task-specific updates of the model weights. Among PEFT methods, LoRA stands out for its simplicity and efficiency, inspiring the development of a series of variants. However, LoRA and its successors disregard the knowledge that is noisy or irrelevant to the targeted task, detrimentally impacting model performance and leading to suboptimality. To address this limitation, we introduce Knowledge-aware Singular-value Adaptation (KaSA), a PEFT method that leverages singular value decomposition (SVD) with knowledge-aware singular values to dynamically activate knowledge based on its relevance to the task at hand. We conduct extensive experiments across a range of LLMs on tasks spanning natural language understanding (NLU), generation (NLG), instruction following, and commonsense reasoning. The experimental results demonstrate that KaSA consistently outperforms FFT and 14 popular PEFT baselines across 16 benchmarks and 4 synthetic datasets, underscoring our method's efficacy and adaptability. The source code of our method is available at https://github.com/juyongjiang/KaSA.

KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models

TL;DR

KaSA tackles the computational and memory burden of full fine-tuning by introducing a knowledge-aware, singular-value–based PEFT. It first performs knowledge-based SVD truncation to form a world-aligned base W_{world} and then learns task updates in SVD form ΔW = ΔU ΔΣ ΔV^T, where the diagonal ΔΣ are knowledge-aware and adapt to task relevance, aided by orthogonal regularization and a two-term loss. This design reduces interference from noisy or irrelevant knowledge and preserves a coherent representation space, yielding consistent improvements over FFT and 14 baselines across NLU, NLG, instruction following, and commonsense reasoning benchmarks; results include strong GLUE performance with limited trainable parameters and robust cross-model gains on synthetic instruction-following tasks. The approach enables efficient, scalable adaptation with negligible inference delay and supports multi-task updates using a single base model, highlighting practical impact for deploying large language models in domain-specific settings.

Abstract

The increasing sizes of large language models (LLMs) result in significant computational overhead and memory usage when adapting these models to specific tasks or domains. Various parameter-efficient fine-tuning (PEFT) methods have been devised to mitigate these challenges by training a small set of parameters for the task-specific updates of the model weights. Among PEFT methods, LoRA stands out for its simplicity and efficiency, inspiring the development of a series of variants. However, LoRA and its successors disregard the knowledge that is noisy or irrelevant to the targeted task, detrimentally impacting model performance and leading to suboptimality. To address this limitation, we introduce Knowledge-aware Singular-value Adaptation (KaSA), a PEFT method that leverages singular value decomposition (SVD) with knowledge-aware singular values to dynamically activate knowledge based on its relevance to the task at hand. We conduct extensive experiments across a range of LLMs on tasks spanning natural language understanding (NLU), generation (NLG), instruction following, and commonsense reasoning. The experimental results demonstrate that KaSA consistently outperforms FFT and 14 popular PEFT baselines across 16 benchmarks and 4 synthetic datasets, underscoring our method's efficacy and adaptability. The source code of our method is available at https://github.com/juyongjiang/KaSA.

Paper Structure

This paper contains 44 sections, 22 equations, 18 figures, 16 tables, 1 algorithm.

Figures (18)

  • Figure 1: The architecture of our proposed KaSA encompasses two stages: (Left) knowledge-based SVD truncation to remove the noisy knowledge from the base model; (Right) knowledge-aware singular-value adaptation to adjust singular values that dynamically activate knowledge across $\Delta \mathbf{W}$ model parameters based on its relevance to downstream tasks.
  • Figure 2: Components ablation study about knowledge-based SVD truncation, knowledge-aware singular value adaptation, singular value regularization $\mathcal{L}_{2}$, and orthogonal regularization $\mathcal{L}_{3}$ on MRPC, CoLA, and RTE datasets.
  • Figure 3: Budget parameter scalability of fine-tuning RoBERTa-base with LoRA, PiSSA, MiLoRA, and KaSA on MRPC, CoLA, and RTE datasets.
  • Figure 4: The final distribution of knowledge-aware singular values for $\mathbf{W}_q$ and $\mathbf{W}_v$ upon fine-tuning the RoBERTa-base model on the MNLI and QQP benchmarks. In this context, the $x$-axis corresponds to the layer index, and the $y$-axis denotes the position index. Each value signifies the relevance of the associated knowledge.
  • Figure 5: Prompt template of data synthesis for summarization tasks by GPT4o.
  • ...and 13 more figures