Table of Contents
Fetching ...

SMARTFinRAG: Interactive Modularized Financial RAG Benchmark

Yiwei Zha

TL;DR

SMARTFinRAG tackles the challenge of evaluating financial Retrieval-Augmented Generation (RAG) systems by delivering a modular, end-to-end benchmarking platform that supports real-time document ingestion, dynamic component swapping, and an interactive demonstration UI. It combines a document-based QA generation workflow with a dual-maceted evaluation engine that reports retrieval quality via HR, MRR, P, R, AP, and NDCG, and generation quality via LLM-as-Judge faithfulness and relevancy. Through experiments across retriever families, LLM backends, and decoding settings, the study shows that hybrid retrievers generally improve grounding, decoding parameters exert model-specific effects, and GPT-4o offers strong but not universal performance. The platform aims to accelerate trustworthy financial NLP research and bridge the gap to production RAG systems by providing reproducible, interactive evaluation and extensible components.

Abstract

Financial sectors are rapidly adopting language model technologies, yet evaluating specialized RAG systems in this domain remains challenging. This paper introduces SMARTFinRAG, addressing three critical gaps in financial RAG assessment: (1) a fully modular architecture where components can be dynamically interchanged during runtime; (2) a document-centric evaluation paradigm generating domain-specific QA pairs from newly ingested financial documents; and (3) an intuitive interface bridging research-implementation divides. Our evaluation quantifies both retrieval efficacy and response quality, revealing significant performance variations across configurations. The platform's open-source architecture supports transparent, reproducible research while addressing practical deployment challenges faced by financial institutions implementing RAG systems.

SMARTFinRAG: Interactive Modularized Financial RAG Benchmark

TL;DR

SMARTFinRAG tackles the challenge of evaluating financial Retrieval-Augmented Generation (RAG) systems by delivering a modular, end-to-end benchmarking platform that supports real-time document ingestion, dynamic component swapping, and an interactive demonstration UI. It combines a document-based QA generation workflow with a dual-maceted evaluation engine that reports retrieval quality via HR, MRR, P, R, AP, and NDCG, and generation quality via LLM-as-Judge faithfulness and relevancy. Through experiments across retriever families, LLM backends, and decoding settings, the study shows that hybrid retrievers generally improve grounding, decoding parameters exert model-specific effects, and GPT-4o offers strong but not universal performance. The platform aims to accelerate trustworthy financial NLP research and bridge the gap to production RAG systems by providing reproducible, interactive evaluation and extensible components.

Abstract

Financial sectors are rapidly adopting language model technologies, yet evaluating specialized RAG systems in this domain remains challenging. This paper introduces SMARTFinRAG, addressing three critical gaps in financial RAG assessment: (1) a fully modular architecture where components can be dynamically interchanged during runtime; (2) a document-centric evaluation paradigm generating domain-specific QA pairs from newly ingested financial documents; and (3) an intuitive interface bridging research-implementation divides. Our evaluation quantifies both retrieval efficacy and response quality, revealing significant performance variations across configurations. The platform's open-source architecture supports transparent, reproducible research while addressing practical deployment challenges faced by financial institutions implementing RAG systems.

Paper Structure

This paper contains 39 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: SMARTFinRAG pipeline. Modular components process raw financial documents into indexed chunks and support retrieval-augmented response generation with multi-faceted evaluation. Each stage is configurable and replaceable.
  • Figure 2: SMARTFinRAG pipeline. Modular components process raw financial documents into indexed chunks and support retrieval-augmented response generation with multi-faceted evaluation. Each stage is configurable and replaceable.
  • Figure 3: Retriever-level ranking metrics across top-$k$ settings for different retriever types.
  • Figure 4: Relationship between temperature settings and response quality metrics (faithfulness and relevancy) across different LLM models. GPT-4o shows optimal performance at moderate temperature (0.3), while other models exhibit non-monotonic quality patterns.
  • Figure 5: Impact of top-p sampling values on response quality across different LLM families. Models exhibit highly divergent patterns, with DeepSeek performing best at low top-p (0.1), GPT-3.5-turbo at high top-p (0.9), and other models showing mixed behavior.
  • ...and 4 more figures