Table of Contents
Fetching ...

GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Malyaban Bal, Abhronil Sengupta

TL;DR

GRASP introduces a lightweight parameter-efficient fine-tuning method that partitions transformer activations into K groups and learns shared per-group affine modulations, reducing trainable parameters to $O(nK)$ while preserving task performance. Building on this, StochGRASP models weight updates as Gaussian perturbations with a noise-aware objective to improve robustness under hardware-level variability, enabling more reliable edge deployment. Empirical results on GLUE (RoBERTa-base/large) and E2E NLG (GPT-2 Medium) show GRASP matches or surpasses established PEFT methods with far fewer trainable parameters, and StochGRASP consistently outperforms deterministic baselines under noise. The work highlights a multimodal, distributional view of PEFT parameters and demonstrates practical benefits for energy-efficient and robust transformer inference on non-ideal hardware.

Abstract

Parameter-efficient fine-tuning (PEFT) provides a scalable alternative to full-model adaptation by updating only a small subset of parameters in large pre-trained models. We introduce GRASP - GRouped Activation Shared Parameterization - a lightweight PEFT framework that partitions the D-dimensional token representations of selected layers into K << D groups and learns a shared scaling and shifting vector for each group. This grouped modulation reduces the number of trainable parameters significantly while preserving the ability of the model to learn task-specific features. Building on this formulation, we further propose StochGRASP, which learns Gaussian distributions as perturbations to the pre-trained weights rather than deterministic values. This probabilistic parameterization along with a noise-aware loss function formulation enables modelling hardware-level variability in programmed weights and significantly improves robustness under non-ideal inference conditions-an important requirement for deployment on edge-based emerging AI hardware. Across GLUE (RoBERTa-base & RoBERTa-large) and E2E NLG (GPT-2 Medium), GRASP matches or exceeds the performance of established PEFT methods while achieving an order of magnitude reduction in trainable parameters compared to LoRA and BitFit. Under varying levels of noise, StochGRASP consistently outperforms deterministic variants, demonstrating its suitability for energy-efficient and noise-prone hardware platforms.

GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

TL;DR

GRASP introduces a lightweight parameter-efficient fine-tuning method that partitions transformer activations into K groups and learns shared per-group affine modulations, reducing trainable parameters to while preserving task performance. Building on this, StochGRASP models weight updates as Gaussian perturbations with a noise-aware objective to improve robustness under hardware-level variability, enabling more reliable edge deployment. Empirical results on GLUE (RoBERTa-base/large) and E2E NLG (GPT-2 Medium) show GRASP matches or surpasses established PEFT methods with far fewer trainable parameters, and StochGRASP consistently outperforms deterministic baselines under noise. The work highlights a multimodal, distributional view of PEFT parameters and demonstrates practical benefits for energy-efficient and robust transformer inference on non-ideal hardware.

Abstract

Parameter-efficient fine-tuning (PEFT) provides a scalable alternative to full-model adaptation by updating only a small subset of parameters in large pre-trained models. We introduce GRASP - GRouped Activation Shared Parameterization - a lightweight PEFT framework that partitions the D-dimensional token representations of selected layers into K << D groups and learns a shared scaling and shifting vector for each group. This grouped modulation reduces the number of trainable parameters significantly while preserving the ability of the model to learn task-specific features. Building on this formulation, we further propose StochGRASP, which learns Gaussian distributions as perturbations to the pre-trained weights rather than deterministic values. This probabilistic parameterization along with a noise-aware loss function formulation enables modelling hardware-level variability in programmed weights and significantly improves robustness under non-ideal inference conditions-an important requirement for deployment on edge-based emerging AI hardware. Across GLUE (RoBERTa-base & RoBERTa-large) and E2E NLG (GPT-2 Medium), GRASP matches or exceeds the performance of established PEFT methods while achieving an order of magnitude reduction in trainable parameters compared to LoRA and BitFit. Under varying levels of noise, StochGRASP consistently outperforms deterministic variants, demonstrating its suitability for energy-efficient and noise-prone hardware platforms.

Paper Structure

This paper contains 18 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: High-level overview of the proposed PEFT framework. (1) GRASP performs parameter-efficient fine-tuning by learning grouped scaling and shifting parameters over activations, enabling low-memory training with deterministic modulation. (2) StochGRASP extends this concept by learning Gaussian perturbation distributions over weight updates rather than fixed values, coupled with a noise-aware objective that yields significantly more robust inference under hardware-level noise than deterministic baselines.
  • Figure 2: (a) Random grouping of per-token $D$-dimensional input activation to a layer into $K$ groups. GRASP learns scaling and shifting parameters per group rather than independently for each component. (b & c) Kernel Density Estimation (KDE) plot showing (b) the distribution of shifting parameter ($\beta$) and (c) the distribution of scaling parameter ($\gamma$) learned in a selected projection layer (Key) when parameters are learned independently versus using GRASP with $K=128$. . Results are from GLUE CoLA dataset.
  • Figure 3: (a) KDE plots of scaling parameter ($\gamma$) distribution in the same projection layer (Key) for GRASP with varying group sizes $K \in \{8, 32, 128\}$. Smaller $K$ values produce distinct parameter clusters (modes), enabling the model to efficiently learn downstream task (GLUE SST-2) with significantly fewer trainable parameters. (b) Graph showing trade-off between accuracy and trainable parameters ($\%$) on SST-2 task using GRASP with different values of $K$.
  • Figure 4: (a) KDE plots of scaling parameter distribution of the same projection layer (Key) for GRASP on CoLA and SST-2.
  • Figure 5: (a) Gaussian distributions (perturbations) learnt without the modified objective, and (b) distributions learned with the proposed objective (Eqn. \ref{['eqn6']}); both learns $K = 16$ distributions per layer.
  • ...and 1 more figures