Table of Contents
Fetching ...

Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation

Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina

TL;DR

This work targets multi-domain retrieval-augmented generation (RAG) by (i) constructing a diverse benchmark across 8 sources and 13 domains to test cross-domain performance, and (ii) systematically evaluating RAG adaptation strategies under domain shift. It finds that standard LLM fine-tuning for RAG often fails to generalize across domains, while sequence-level distillation using teacher-generated labels substantially improves out-of-domain performance by fostering more coherent supervision. The authors also show that targeted attention-pattern tweaks via LoRA-QKAtt can enhance robustness, and that RagChecker-based analysis reveals improved faithfulness and reduced hallucination with distilled labels. Overall, the paper highlights practical strategies to bolster multi-domain RAG robustness in the face of domain shift.

Abstract

Retrieval-Augmented Generation (RAG) enhances LLM factuality, but multi-domain applications face challenges like lack of diverse benchmarks and poor out-of-domain generalization. The first contribution of this work is to introduce a diverse benchmark comprising a variety of question-answering tasks from 8 sources and covering 13 domains. Our second contribution consists in systematically testing out-of-domain generalization for typical RAG tuning strategies. While our findings reveal that standard fine-tuning fails to generalize effectively, we show that sequence-level distillation with teacher-generated labels improves out-of-domain performance by providing more coherent supervision. Our findings highlight key strategies for improving multi-domain RAG robustness.

Adapting Large Language Models for Multi-Domain Retrieval-Augmented-Generation

TL;DR

This work targets multi-domain retrieval-augmented generation (RAG) by (i) constructing a diverse benchmark across 8 sources and 13 domains to test cross-domain performance, and (ii) systematically evaluating RAG adaptation strategies under domain shift. It finds that standard LLM fine-tuning for RAG often fails to generalize across domains, while sequence-level distillation using teacher-generated labels substantially improves out-of-domain performance by fostering more coherent supervision. The authors also show that targeted attention-pattern tweaks via LoRA-QKAtt can enhance robustness, and that RagChecker-based analysis reveals improved faithfulness and reduced hallucination with distilled labels. Overall, the paper highlights practical strategies to bolster multi-domain RAG robustness in the face of domain shift.

Abstract

Retrieval-Augmented Generation (RAG) enhances LLM factuality, but multi-domain applications face challenges like lack of diverse benchmarks and poor out-of-domain generalization. The first contribution of this work is to introduce a diverse benchmark comprising a variety of question-answering tasks from 8 sources and covering 13 domains. Our second contribution consists in systematically testing out-of-domain generalization for typical RAG tuning strategies. While our findings reveal that standard fine-tuning fails to generalize effectively, we show that sequence-level distillation with teacher-generated labels improves out-of-domain performance by providing more coherent supervision. Our findings highlight key strategies for improving multi-domain RAG robustness.

Paper Structure

This paper contains 39 sections, 8 figures, 21 tables.

Figures (8)

  • Figure 1: LLMEval scores for three LLM generators evaluated across four datasets and multiple retriever/reranker configurations. Oracle documents shown where available.. Full results given in Appendix Table \ref{['tab:fig1_full_table']}.
  • Figure 2: BioASQ-12b across Top-k Retrieved Documents (with splade-v3 retriever, DeBERTa-v3 reranker). $D$ denotes distractor documents chosen at random from PubMed abstracts to add noise to the context. The smaller model is more sensible to noise.
  • Figure 3: RAG adaptation results, LLMEval. Color shows difference with the domain's Vanilla RAG score. Averages are computed across all individual datasets.
  • Figure 4: RAGChecker metrics on a subset of domains for Llama-3.2-1B. Effects on other domains are identical to one of the reported domains. All metrics are "the higher the better". The BioASQ dataset has short ground truth labels while other datasets have long labels. To compute metrics we use a Qwen claim extractor to segment out claims made by the generator, and a RoBERTa claim entailment checker.
  • Figure 5: BioASQ-11b Performance Across Retrievers (SOLAR-10.7B generator)
  • ...and 3 more figures