Table of Contents
Fetching ...

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

TL;DR

LLaMA-Excitor presents a novel parameter-efficient fine-tuning approach that indirectly enhances instruction-following by inserting Excitor blocks into the top attention layers to modify similarity scores via learnable prompts, preserving the pre-trained reasoning distribution. The method supports a unified multi-modal extension using a frozen visual encoder, avoiding heavy alignment modules and enabling vision-language capabilities with minimal overhead. Empirically, it achieves significant gains on MMLU (+6% relative in some settings), strong image-captioning performance on MSCOCO (157.5 CIDEr), and competitive ScienceQA results, while mitigating forgetting compared to full fine-tuning. The approach demonstrates that indirect feature interaction can unlock LLM potential without overhauling the model’s foundational capabilities, offering practical benefits for both language-only and vision-language applications.

Abstract

Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, which introduce extra modules or additional input sequences to inject new skills or knowledge, may compromise the innate abilities of LLMs. In this paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. Specifically, the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets. Furthermore, we unify the modeling of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment. Our proposed approach is evaluated in language-only and multi-modal tuning experimental scenarios. Notably, LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement (+6%) on the MMLU benchmark. In the visual instruction tuning, we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining.

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

TL;DR

LLaMA-Excitor presents a novel parameter-efficient fine-tuning approach that indirectly enhances instruction-following by inserting Excitor blocks into the top attention layers to modify similarity scores via learnable prompts, preserving the pre-trained reasoning distribution. The method supports a unified multi-modal extension using a frozen visual encoder, avoiding heavy alignment modules and enabling vision-language capabilities with minimal overhead. Empirically, it achieves significant gains on MMLU (+6% relative in some settings), strong image-captioning performance on MSCOCO (157.5 CIDEr), and competitive ScienceQA results, while mitigating forgetting compared to full fine-tuning. The approach demonstrates that indirect feature interaction can unlock LLM potential without overhauling the model’s foundational capabilities, offering practical benefits for both language-only and vision-language applications.

Abstract

Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, which introduce extra modules or additional input sequences to inject new skills or knowledge, may compromise the innate abilities of LLMs. In this paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. Specifically, the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets. Furthermore, we unify the modeling of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment. Our proposed approach is evaluated in language-only and multi-modal tuning experimental scenarios. Notably, LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement (+6%) on the MMLU benchmark. In the visual instruction tuning, we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining.
Paper Structure (22 sections, 9 equations, 7 figures, 9 tables)

This paper contains 22 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of LLaMA-Excitor. We integrate Excitor blocks into L out of N attention layers of LLaMA. Differing from previous PEFT techniques, LLaMA-Excitor indirectly involves learnable information in the reasoning process by changing the similarity matrices. It ensures that the hidden states are within the original distribution of LLaMA.
  • Figure 2: Details of the Excitor block. We assign a set of learnable prompts for attention layers of LLaMA. These prompts are used to construct an extra $Key$ for computing additional similarity scores, which are then merged into the original scores to alter the LLM's behavior. Cold-start gating factors are designed to stabilize the training.
  • Figure 3: Extend LLaMA-Excitor into a powerful multi-modal model. Owing to the indirect feature interaction, LLaMA-Excitor is the cheapest PEFT method that can follow visual instructions without complicated projection modules aligning vision and language.
  • Figure 4: Quantitative comparisons between LLaMA-Excitor (BLUE) with other methods, evaluated by GPT-4 openai2023gpt.
  • Figure 5: The relative average performance changes and win-loss situations of fine-tunings compared to original LLaMA-7B on MMLU.
  • ...and 2 more figures