One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Yutao Zhu; Zhaoheng Huang; Zhicheng Dou; Ji-Rong Wen

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Yutao Zhu, Zhaoheng Huang, Zhicheng Dou, Ji-Rong Wen

TL;DR

This work introduces SPRING, a lightweight, parameter-efficient method for retrieval-augmented generation that learns scalable and pluggable virtual token embeddings while freezing the backbone LLM. By inserting trainable tokens between retrieved results and the user query, SPRING enhances the model's ability to leverage external knowledge without compromising non-RAG capabilities. The approach demonstrates strong improvements across 12 QA datasets and shows robustness to different retrievers, passage counts, and cross-dataset training regimes. Its plug-and-play design and minimal additional parameters make SPRING practical for deploying RAG-enabled LLMs in real-world settings while maintaining core generation quality.

Abstract

Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capabilities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across 12 question-answering tasks demonstrate the superiority of our approach.

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

TL;DR

Abstract

Paper Structure (31 sections, 3 equations, 10 figures, 12 tables)

This paper contains 31 sections, 3 equations, 10 figures, 12 tables.

Introduction
Related Work
Retrieval-Augmented Generation
Parameter-Efficient Fine-Tuning
Methodology
Problem Formulation
Scalable and Pluggable Virtual Tokens for RAG
Scalable
Pluggable
Inference
Experiment
Datasets and Retrievers
Baseline Methods
Implementation Details
Experimental Results
...and 16 more sections

Figures (10)

Figure 1: Illustration of existing methods for RAG and our proposed method. Our method can improve LLMs' performance in RAG scenarios by incorporating trainable virtual tokens, and these tokens can be removed to preserve the general generation abilities in non-RAG scenarios.
Figure 2: Illustration of SPRING. Only the embeddings of the added $n$ tokens are trainable during fine-tuning. The added tokens are scalable where any first $k (k\leq n)$ tokens can be used in inference.
Figure 3: Average performance on nine QA datasets with various numbers of virtual tokens.
Figure 4: Average performance on nine QA datasets with different number of retrieved passages.
Figure 5: An example code snippet of using our SPRING in practice.
...and 5 more figures

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

TL;DR

Abstract

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)