Table of Contents
Fetching ...

SPAR: Session-based Pipeline for Adaptive Retrieval on Legacy File Systems

Duy A. Nguyen, Hai H. Do, Minh Doan, Minh N. Do

TL;DR

The paper tackles the challenge of retrieving value from large, legacy enterprise file systems that lack semantic indexing. It introduces SPAR, a session-based retrieval framework that decouples retrieval from a global vector store by using a lightweight Metadata Index and on-demand, session-specific vector databases organized into workspaces. Theoretical analysis shows cost and latency benefits over traditional RAG, and experiments on a synthesized biomedical corpus demonstrate improved retrieval accuracy and modest gains in downstream task performance. The work discusses trade-offs, including metadata quality dependency and per-session overhead, and outlines directions for future enhancements to deploy SPAR across diverse enterprise settings.

Abstract

The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Session-based Pipeline for Adaptive Retrieval), a conceptual framework that integrates Large Language Models (LLMs) into a Retrieval-Augmented Generation (RAG) architecture specifically designed for legacy enterprise environments. Unlike conventional RAG pipelines, which require costly construction and maintenance of full-scale vector databases that mirror the entire file system, SPAR employs a lightweight two-stage process: a semantic Metadata Index is first created, after which session-specific vector databases are dynamically generated on demand. This design reduces computational overhead while improving transparency, controllability, and relevance in retrieval. We provide a theoretical complexity analysis comparing SPAR with standard LLM-based RAG pipelines, demonstrating its computational advantages. To validate the framework, we apply SPAR to a synthesized enterprise-scale file system containing a large corpus of biomedical literature, showing improvements in both retrieval effectiveness and downstream model accuracy. Finally, we discuss design trade-offs and outline open challenges for deploying SPAR across diverse enterprise settings.

SPAR: Session-based Pipeline for Adaptive Retrieval on Legacy File Systems

TL;DR

The paper tackles the challenge of retrieving value from large, legacy enterprise file systems that lack semantic indexing. It introduces SPAR, a session-based retrieval framework that decouples retrieval from a global vector store by using a lightweight Metadata Index and on-demand, session-specific vector databases organized into workspaces. Theoretical analysis shows cost and latency benefits over traditional RAG, and experiments on a synthesized biomedical corpus demonstrate improved retrieval accuracy and modest gains in downstream task performance. The work discusses trade-offs, including metadata quality dependency and per-session overhead, and outlines directions for future enhancements to deploy SPAR across diverse enterprise settings.

Abstract

The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Session-based Pipeline for Adaptive Retrieval), a conceptual framework that integrates Large Language Models (LLMs) into a Retrieval-Augmented Generation (RAG) architecture specifically designed for legacy enterprise environments. Unlike conventional RAG pipelines, which require costly construction and maintenance of full-scale vector databases that mirror the entire file system, SPAR employs a lightweight two-stage process: a semantic Metadata Index is first created, after which session-specific vector databases are dynamically generated on demand. This design reduces computational overhead while improving transparency, controllability, and relevance in retrieval. We provide a theoretical complexity analysis comparing SPAR with standard LLM-based RAG pipelines, demonstrating its computational advantages. To validate the framework, we apply SPAR to a synthesized enterprise-scale file system containing a large corpus of biomedical literature, showing improvements in both retrieval effectiveness and downstream model accuracy. Finally, we discuss design trade-offs and outline open challenges for deploying SPAR across diverse enterprise settings.

Paper Structure

This paper contains 25 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of ordinary LLM-based RAG pipeline versus proposed SPAR pipeline.
  • Figure 2: Detailed components of SPAR pipeline.
  • Figure 3: Conceptual steps to build and maintain a Metadata Index.
  • Figure 4: Example of a Hierarchical File Tag in a medical application. Leaf tags (orange) are defined by system owners or extracted with LLMs, while intermediate tags are generated to group related concepts. Multi-level queries allow broader or narrower retrieval based on user intent.
  • Figure 5: Example of Session-based Retrieval in a medical application. Each workspace corresponds to an active task with its own temporary vector database, while processed files can be cached and reused across workspaces to reduce the overhead of workspace creation.
  • ...and 2 more figures