Table of Contents
Fetching ...

RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

Muhammed Yusuf Kartal, Suha Kagan Kose, Korhan Sevinç, Burak Aktas

TL;DR

RAGSmith addresses the limitation of optimizing RAG pipelines module-by-module by treating RAG design as a holistic architecture search. It introduces a nine-family modular RAG design space and uses a genetic algorithm to search through 46,080 configurations, evaluating end-to-end performance with a unified objective. Across six domain datasets derived from Wikipedia, RAGSmith consistently outperforms naive baselines, achieving an average improvement of +3.8% and notable gains in retrieval (+ up to +12.5%) and generation (+ up to +7.5%). The study identifies a robust backbone consisting of vector retrieval and reflection/revision, while emphasizing domain-adaptive augmentation and query-conditioning choices that depend on dataset characteristics and question-type distributions. These findings provide practical design principles and demonstrate the viability of evolutionary search for full-pipeline RAG optimization in real-world, domain-specific settings.

Abstract

Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8\% on average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in retrieval and +7.5\% in generation. The search typically explores $\approx 0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone -- vector retrieval plus post-generation reflection/revision -- augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.

RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

TL;DR

RAGSmith addresses the limitation of optimizing RAG pipelines module-by-module by treating RAG design as a holistic architecture search. It introduces a nine-family modular RAG design space and uses a genetic algorithm to search through 46,080 configurations, evaluating end-to-end performance with a unified objective. Across six domain datasets derived from Wikipedia, RAGSmith consistently outperforms naive baselines, achieving an average improvement of +3.8% and notable gains in retrieval (+ up to +12.5%) and generation (+ up to +7.5%). The study identifies a robust backbone consisting of vector retrieval and reflection/revision, while emphasizing domain-adaptive augmentation and query-conditioning choices that depend on dataset characteristics and question-type distributions. These findings provide practical design principles and demonstrate the viability of evolutionary search for full-pipeline RAG optimization in real-world, domain-specific settings.

Abstract

Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8\% on average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in retrieval and +7.5\% in generation. The search typically explores of the space ( candidates) and discovers a robust backbone -- vector retrieval plus post-generation reflection/revision -- augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.

Paper Structure

This paper contains 96 sections, 1 equation, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: RAG Technique Categories
  • Figure 2: All RAG Techniques used in RAGSmith
  • Figure 3: Retrieval and generation score comparisons across datasets.
  • Figure 4: Overall performance and improvement percentages obtained by RAGSmith.