Table of Contents
Fetching ...

Propulsion: Steering LLM with Tiny Fine-Tuning

Md Kowsher, Nusrat Jahan Prottasha, Prakash Bhat

TL;DR

Propulsion presents a parameter-efficient fine-tuning method that freezes pre-trained weights and learns a per-layer diagonal Propulsion matrix to re-scale layer outputs, effectively steering model behavior with a fraction of trainable parameters. Grounded by Neural Tangent Kernel (NTK) theory, Propulsion shows that updating a diagonal subset of parameters can closely approximate full fine-tuning, with formal NTK bounds and empirical validation. Across GLUE, SQuAD, summarization, and multiple large language models, Propulsion delivers competitive or superior performance while dramatically reducing parameter count and training resources compared to established PEFT methods. The approach yields faster convergence and lower memory usage, offering a practical, scalable path for task-specific adaptation of large transformers, albeit with some limitations on the granularity of control and dependence on pre-trained model quality.

Abstract

The rapid advancements in Large Language Models (LLMs) have revolutionized natural language processing (NLP) and related fields. However, fine-tuning these models for specific tasks remains computationally expensive and risks degrading pre-learned features. To address these challenges, we propose Propulsion, a novel parameter efficient fine-tuning (PEFT) method designed to optimize task-specific performance while drastically reducing computational overhead. Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model, guiding output predictions toward task objectives without modifying the model's parameters. By introducing lightweight, trainable Propulsion parameters at the pre-trained layer, we minimize the number of parameters updated during fine-tuning, preventing overfitting or overwriting of existing knowledge. Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters. Empirically, Propulsion reduces the parameter count from 355.3 million to just 0.086 million, achieving over a 10x reduction compared to standard approaches like LoRA while maintaining competitive performance across benchmarks.

Propulsion: Steering LLM with Tiny Fine-Tuning

TL;DR

Propulsion presents a parameter-efficient fine-tuning method that freezes pre-trained weights and learns a per-layer diagonal Propulsion matrix to re-scale layer outputs, effectively steering model behavior with a fraction of trainable parameters. Grounded by Neural Tangent Kernel (NTK) theory, Propulsion shows that updating a diagonal subset of parameters can closely approximate full fine-tuning, with formal NTK bounds and empirical validation. Across GLUE, SQuAD, summarization, and multiple large language models, Propulsion delivers competitive or superior performance while dramatically reducing parameter count and training resources compared to established PEFT methods. The approach yields faster convergence and lower memory usage, offering a practical, scalable path for task-specific adaptation of large transformers, albeit with some limitations on the granularity of control and dependence on pre-trained model quality.

Abstract

The rapid advancements in Large Language Models (LLMs) have revolutionized natural language processing (NLP) and related fields. However, fine-tuning these models for specific tasks remains computationally expensive and risks degrading pre-learned features. To address these challenges, we propose Propulsion, a novel parameter efficient fine-tuning (PEFT) method designed to optimize task-specific performance while drastically reducing computational overhead. Inspired by the concept of controlled adjustments in physical motion, Propulsion selectively re-scales specific dimensions of a pre-trained model, guiding output predictions toward task objectives without modifying the model's parameters. By introducing lightweight, trainable Propulsion parameters at the pre-trained layer, we minimize the number of parameters updated during fine-tuning, preventing overfitting or overwriting of existing knowledge. Our theoretical analysis, supported by Neural Tangent Kernel (NTK) theory, shows that Propulsion approximates the performance of full fine-tuning with far fewer trainable parameters. Empirically, Propulsion reduces the parameter count from 355.3 million to just 0.086 million, achieving over a 10x reduction compared to standard approaches like LoRA while maintaining competitive performance across benchmarks.
Paper Structure (28 sections, 1 theorem, 34 equations, 11 figures, 24 tables, 1 algorithm)

This paper contains 28 sections, 1 theorem, 34 equations, 11 figures, 24 tables, 1 algorithm.

Key Result

Theorem 1

Let $\phi_P(\mathbf{x}; \theta_t)$ be the output of the Propulsion model at time step $t$, where the base matrix $\theta_0$ is pre-trained and fixed, and the Propulsion matrix $\mathbf{z}_t$ is updated during training. Let $\phi_F(\mathbf{x}; \theta_t)$ be the output of the fully fine-tuned model at Furthermore, the error between the NTK for Propulsion and the NTK for full fine-tuning can be bound

Figures (11)

  • Figure 1: A detailed illustration of the model architectures for five different adapters: (a) LoRA, (b) AdaLoRA, (c) Prefix & Prompt Tuning, and (d) Propulsion. In the diagrams, W represents the pre-trained weight matrix, which is kept frozen, while X denotes the input. The matrices A, B, and E are trainable and of lower rank. The variable z indicates the Propulsion parameter.
  • Figure 2: Propulsion in Transformer Block. Within the figure, the red cells represent trainable parameters while the blue cells represent the frozen parameters. The Propulsion layers above shows where our method executes during model fine-tuning. All layers use the same Propulsion matrix, but are modified by their corresponding vector $z_{i}$.
  • Figure 3: Comparative Analysis of PEFT Methods on the SST-2 Dataset. On the right-side graph, we shortened the following method names: AdaLoRA to AdaL., Prompt Tuning to Prom., Propulsion to Propul, and Prefix-Tuning to Pref. In this graph, purple represents the percentage of parameters after applying these methods, the cyan represents the total training time in hours, and the green represents the iteration time in seconds.
  • Figure 4: Memory Cost Comparison of PEFT Methods. The blue bars represent the memory cost of the original model weights, whereas the green bars represent the optimization memory cost for each of these methods.
  • Figure 5: Left: performance vs. degree for SST-2, QNLI, and MRPC. Right: training steps vs. accuracy for SST-2.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Theorem 1