Table of Contents
Fetching ...

Enhancing Learned Knowledge in LoRA Adapters Through Efficient Contrastive Decoding on Ascend NPUs

Morgan Lindsay Heisler, Linzi Xing, Ge Shi, Hanieh Sadri, Gursimran Singh, Weiwei Zhang, Tao Ye, Ying Xiong, Yong Zhang, Zhenan Fan

TL;DR

LoRA fine-tuning boosts efficiency but standard decoding biases limit reasoning on LoRA-adapted models. CoLD reframes decoding as a contrastive process between the LoRA-expert and base amateur model, using an $\alpha$-masked expert and a $\beta$-penalty on the base model to prioritize task-specific knowledge. An optimized Batched Gather Matrix-Vector (BGMV) kernel for Ascend NPUs enables multi-LoRA inference with memory savings and reduced latency, achieving up to 5.54% accuracy gains on GSM8K and a 28% end-to-end latency reduction. The approach is integrated into Ascend-vLLM/Hugging Face pipelines, demonstrating practical, hardware-conscious decoding improvements for production deployments in cloud and on-prem environments.

Abstract

Huawei Cloud users leverage LoRA (Low-Rank Adaptation) as an efficient and scalable method to fine-tune and customize large language models (LLMs) for application-specific needs. However, tasks that require complex reasoning or deep contextual understanding are often hindered by biases or interference from the base model when using typical decoding methods like greedy or beam search. These biases can lead to generic or task-agnostic responses from the base model instead of leveraging the LoRA-specific adaptations. In this paper, we introduce Contrastive LoRA Decoding (CoLD), a novel decoding framework designed to maximize the use of task-specific knowledge in LoRA-adapted models, resulting in better downstream performance. CoLD uses contrastive decoding by scoring candidate tokens based on the divergence between the probability distributions of a LoRA-adapted expert model and the corresponding base model. This approach prioritizes tokens that better align with the LoRA's learned representations, enhancing performance for specialized tasks. While effective, a naive implementation of CoLD is computationally expensive because each decoding step requires evaluating multiple token candidates across both models. To address this, we developed an optimized kernel for Huawei's Ascend NPU. CoLD achieves up to a 5.54% increase in task accuracy while reducing end-to-end latency by 28% compared to greedy decoding. This work provides practical and efficient decoding strategies for fine-tuned LLMs in resource-constrained environments and has broad implications for applied data science in both cloud and on-premises settings.

Enhancing Learned Knowledge in LoRA Adapters Through Efficient Contrastive Decoding on Ascend NPUs

TL;DR

LoRA fine-tuning boosts efficiency but standard decoding biases limit reasoning on LoRA-adapted models. CoLD reframes decoding as a contrastive process between the LoRA-expert and base amateur model, using an -masked expert and a -penalty on the base model to prioritize task-specific knowledge. An optimized Batched Gather Matrix-Vector (BGMV) kernel for Ascend NPUs enables multi-LoRA inference with memory savings and reduced latency, achieving up to 5.54% accuracy gains on GSM8K and a 28% end-to-end latency reduction. The approach is integrated into Ascend-vLLM/Hugging Face pipelines, demonstrating practical, hardware-conscious decoding improvements for production deployments in cloud and on-prem environments.

Abstract

Huawei Cloud users leverage LoRA (Low-Rank Adaptation) as an efficient and scalable method to fine-tune and customize large language models (LLMs) for application-specific needs. However, tasks that require complex reasoning or deep contextual understanding are often hindered by biases or interference from the base model when using typical decoding methods like greedy or beam search. These biases can lead to generic or task-agnostic responses from the base model instead of leveraging the LoRA-specific adaptations. In this paper, we introduce Contrastive LoRA Decoding (CoLD), a novel decoding framework designed to maximize the use of task-specific knowledge in LoRA-adapted models, resulting in better downstream performance. CoLD uses contrastive decoding by scoring candidate tokens based on the divergence between the probability distributions of a LoRA-adapted expert model and the corresponding base model. This approach prioritizes tokens that better align with the LoRA's learned representations, enhancing performance for specialized tasks. While effective, a naive implementation of CoLD is computationally expensive because each decoding step requires evaluating multiple token candidates across both models. To address this, we developed an optimized kernel for Huawei's Ascend NPU. CoLD achieves up to a 5.54% increase in task accuracy while reducing end-to-end latency by 28% compared to greedy decoding. This work provides practical and efficient decoding strategies for fine-tuned LLMs in resource-constrained environments and has broad implications for applied data science in both cloud and on-premises settings.

Paper Structure

This paper contains 31 sections, 2 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Contrastive LoRA Decoding Framework on Huawei Cloud Service: In practice, contrastive decoding (a) operates through the contrastive interaction between a small amateur model and a large expert model, typically selected from the same model family. Our proposed CoLD (b) formulates the LoRA adapter, combined with the base language model, as the expert model, while the base language model alone serves as the amateur model.
  • Figure 2: TPOT comparison between Gather-BMM and BGMV on Ascend NPU. Gather-BMM runs out of memory at rank 64 for batch size larger than 16.