Table of Contents
Fetching ...

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

TL;DR

This paper tackles domain adaptation for retrieval-augmented generation by proposing SimRAG, a self-training framework that endows a single LLM with joint question answering and question generation capabilities. It uses a two-stage pipeline: Stage-I fine-tunes the model on general instruction-following and retrieval-aware data, while Stage-II generates and filters pseudo-labeled domain-specific QA pairs from unlabeled corpora to further adapt the model. Empirical results across 11 datasets in medical, scientific, and CS domains show consistent improvements over baselines (approximately 1.2 to 8.6 percentage points) and competitive performance with strong proprietary models, highlighting the effectiveness of joint QA/QG and self-training for domain-specific RAG. The work offers a cost-efficient, privacy-conscious approach to adapting LLMs to specialized knowledge areas using unlabeled data, with broad implications for science and medicine applications.

Abstract

Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

TL;DR

This paper tackles domain adaptation for retrieval-augmented generation by proposing SimRAG, a self-training framework that endows a single LLM with joint question answering and question generation capabilities. It uses a two-stage pipeline: Stage-I fine-tunes the model on general instruction-following and retrieval-aware data, while Stage-II generates and filters pseudo-labeled domain-specific QA pairs from unlabeled corpora to further adapt the model. Empirical results across 11 datasets in medical, scientific, and CS domains show consistent improvements over baselines (approximately 1.2 to 8.6 percentage points) and competitive performance with strong proprietary models, highlighting the effectiveness of joint QA/QG and self-training for domain-specific RAG. The work offers a cost-efficient, privacy-conscious approach to adapting LLMs to specialized knowledge areas using unlabeled data, with broad implications for science and medicine applications.

Abstract

Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.

Paper Structure

This paper contains 24 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Two-stage fine-tuning framework for our proposed method SimRAG. The model is first fine-tuned on retrieval-related data. Then, it generates pseudo-labeled tuples by first extracting candidate answers from the corpus, and then generating candidate questions conditioned on both document and answer. The LLM is further fine-tuned on pseudo-labeled examples filtered with round-trip consistency.
  • Figure 2: Effect of diverse types of generated QA pairs.
  • Figure 3: Effect of different generation model.