Efficient Knowledge Feeding to Language Models: A Novel Integrated Encoder-Decoder Architecture
S Santosh Kumar, Rishi Gottimukkala, Supriya Devidutta, Karthikeyan S
TL;DR
This work tackles the limitations of static knowledge in large language models and the token-length constraints of retrieval-augmented generation by proposing an integrated encoder-decoder that uses In-Context Vectors (ICV) to embed task information directly into the model's latent space. The architecture fuses retrieval with generation through a cross-attention mechanism, where ICVs modulate latent states to distill retrieved knowledge without adding demonstrations to prompts. Training combines generation loss $\mathcal{L}_{\text{gen}}$ and cosine loss $\mathcal{L}_{\text{cos}}$, with a dynamic weight $\alpha$ that shifts emphasis from retrieval to generation as representations align. Empirical results on Natural Questions, TriviaQA, and HotpotQA show competitive generation performance and superior retrieval accuracy, with ~140M-parameter models approaching the performance of substantially larger systems like LLaMA-3, Gemma, and Phi-3, while offering reduced compute and memory requirements. The approach therefore provides a robust, scalable solution for efficient knowledge integration in LLMs, addressing both token-limits and retrieval accuracy concerns in diverse, data-rich tasks.
Abstract
This paper introduces a novel approach to efficiently feeding knowledge to language models (LLMs) during prediction by integrating retrieval and generation processes within a unified framework. While the Retrieval-Augmented Generation (RAG) model addresses gaps in LLMs' training data and knowledge limits, it is hindered by token limit restrictions and dependency on the retrieval system's accuracy. Our proposed architecture incorporates in-context vectors (ICV) to overcome these challenges. ICV recasts in-context learning by using latent embeddings of LLMs to create a vector that captures essential task information. This vector is then used to shift the latent states of the LLM, enhancing the generation process without adding demonstration examples to the prompt. ICV directly integrates information into the model, enabling it to process this information more effectively. Our extensive experimental evaluation demonstrates that ICV outperforms standard in-context learning and fine-tuning across question-answering, information retrieval, and other tasks. This approach mitigates the limitations of current RAG models and offers a more robust solution for handling extensive and diverse datasets. Despite leveraging a fraction of the parameters, our ICV-enhanced model achieves competitive performance against models like LLaMA-3, Gemma, and Phi-3, significantly reducing computational costs and memory requirements. ICV reduces prompt length, is easy to control, surpasses token limitations, and is computationally efficient compared to fine-tuning.
