PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

Zhenting Qi; Xiaoyu Tan; Shaojie Shi; Chao Qu; Yinghui Xu; Yuan Qi

PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

Zhenting Qi, Xiaoyu Tan, Shaojie Shi, Chao Qu, Yinghui Xu, Yuan Qi

TL;DR

PiLLow tackles the high cost of supervised fine-tuning for large language models by combining LoRA with a discrimination-based, RL-driven prompt matching mechanism. It learns to select and prepend exemplars from a user-defined pool to user instructions, enabling strong instruction-following with modest compute on consumer-grade GPUs. The approach achieves SFT-like performance on standard instruction datasets (Alpaca and Dolly) across multiple model sizes, with larger models deriving greater gains, while maintaining efficiency. This work offers a practical path to accessible, high-quality instruction tuning in low-resource settings and opens avenues for further RL-based prompting research.

Abstract

Instruction fine-tuning has conventionally been employed to adapt Large Language Models (LLMs) to a variety of tasks. Nonetheless, this technique often necessitates substantial computational resources, making it impractical for deployment by individuals or small-scale entities. Recently, Low-Rank Adaptation (LoRA) has become a promising alternative, offering high capabilities on par with full tuning with reduced resource overhead. However, attaining satisfactory performance through the fine-tuning of LoRA is a non-trivial challenge. In this paper, we propose PILLOW, which aims to improve LoRA's performance by a discrimination-based prompting method, leveraging LLMs' In-Context Learning ability. PILLOW incorporates a matching network that selects prompts from a user-defined prompt pool, concatenates the selected prompts with the user instruction as input, and performs inference using the LoRA-fine-tuned LLMs. Trained with Reinforcement Learning, PILLOW exhibits commensurate performance on various evaluation metrics compared with typical instruction fine-tuning methods, utilizing only consumer-grade GPU resources and exhibiting a large reduction in computational costs.

PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 2 figures, 3 tables)

This paper contains 21 sections, 3 equations, 2 figures, 3 tables.

Introduction
Method
Preliminary
Supervised Fine-tuning and LoRA
In-Context Learning
PiLLow
Motivation
RL-based Prompt Matching
Experiments
Datasets
Reward Function
Experiment Setup
Evaluation
Results
Ablation Studies
...and 6 more sections

Figures (2)

Figure 1: A demonstration of 2-shot PiLLow.
Figure 2: Illustration of PiLLow. The left figure shows how the matching net is trained: At each step (out of $m$ steps), one prompt is selected from the prompt set by the matching net according to the user query and current matched prompts. After prompts are collected, they are passed to the LLM to get the answer, from which we calculate a reward. The right one shows the detailed pipeline of the matching network: The left MLP transforms the prompts into a set of vectors, with which we calculate dot products with the vector transformed by the right MLP from the state representation, and we obtain a probability distribution over the prompts.

PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

TL;DR

Abstract

PILLOW: Enhancing Efficient Instruction Fine-tuning via Prompt Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (2)