Table of Contents
Fetching ...

TS-PEFT: Unveiling Token-Level Redundancy in Parameter-Efficient Fine-Tuning

Dabiao Ma, Ziming Dai, Zhimin Xin, Shu Wang, Ye Wang, Haojun Fei

TL;DR

This work reveals pervasive token-level redundancy in parameter-efficient fine-tuning and introduces TS-PEFT, a token-gating framework that uses proximal optimization to selectively update tokens during training. By learning per-layer thresholds, TS-PEFT can discard 40–60% of token updates without sacrificing accuracy, often surpassing dense baselines like LoRA and AdaLoRA. The approach also shows that token-level sparsity is a strong indicator of module importance, enabling more effective module selection under fixed parameter budgets and offering a path toward hardware-aware sparse fine-tuning. Overall, TS-PEFT provides a principled mechanism to identify and exploit redundancy in large models, with implications for both software design and future sparse hardware support.

Abstract

Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention and unveil a pervasive token-level redundancy in the fine-tuning of large models. We propose TS-PEFT, a theoretically grounded framework utilizing proximal optimization to dynamically identify and skip redundant token updates during training. Our extensive experiments across Natural Language Understanding, Commonsense Reasoning, and Visual Instruction Tuning demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Strikingly, by discarding 40%-60% of token updates, TS-PEFT consistently matches or surpasses the performance of dense baselines (e.g., LoRA, DoRA). Furthermore, we provide an in-depth analysis revealing that the learned token-level sparsity serves as a superior indicator of module importance compared to traditional weight norms, offering a novel data-driven perspective on the intrinsic adaptation mechanism of large models.

TS-PEFT: Unveiling Token-Level Redundancy in Parameter-Efficient Fine-Tuning

TL;DR

This work reveals pervasive token-level redundancy in parameter-efficient fine-tuning and introduces TS-PEFT, a token-gating framework that uses proximal optimization to selectively update tokens during training. By learning per-layer thresholds, TS-PEFT can discard 40–60% of token updates without sacrificing accuracy, often surpassing dense baselines like LoRA and AdaLoRA. The approach also shows that token-level sparsity is a strong indicator of module importance, enabling more effective module selection under fixed parameter budgets and offering a path toward hardware-aware sparse fine-tuning. Overall, TS-PEFT provides a principled mechanism to identify and exploit redundancy in large models, with implications for both software design and future sparse hardware support.

Abstract

Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention and unveil a pervasive token-level redundancy in the fine-tuning of large models. We propose TS-PEFT, a theoretically grounded framework utilizing proximal optimization to dynamically identify and skip redundant token updates during training. Our extensive experiments across Natural Language Understanding, Commonsense Reasoning, and Visual Instruction Tuning demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Strikingly, by discarding 40%-60% of token updates, TS-PEFT consistently matches or surpasses the performance of dense baselines (e.g., LoRA, DoRA). Furthermore, we provide an in-depth analysis revealing that the learned token-level sparsity serves as a superior indicator of module importance compared to traditional weight norms, offering a novel data-driven perspective on the intrinsic adaptation mechanism of large models.

Paper Structure

This paper contains 20 sections, 15 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between standard PEFT and our TS-PEFT framework.
  • Figure 2: Performance of LLaMA3.1-8B on CSR benchmarks as the percentage of selected modules varies for each PEFT method. Note the performance drops (e.g., DoRA at 60%) when redundant, high-sparsity modules are forced to update, indicating noise injection.
  • Figure 3: Training loss progression for layer base_model.layers.20.self_attn.v_proj of LLaMA3.1-8B.