A Proposed Large Language Model-Based Smart Search for Archive System

Ha Dung Nguyen; Thi-Hoang Anh Nguyen; Thanh Binh Nguyen

A Proposed Large Language Model-Based Smart Search for Archive System

Ha Dung Nguyen, Thi-Hoang Anh Nguyen, Thanh Binh Nguyen

TL;DR

This work addresses the challenge of semantic, multimodal archival search beyond traditional keyword methods by employing a Retrieval-Augmented Generation framework powered by Large Language Models. It introduces a modular architecture with a Translator, Router Query Engine, Hybrid Retriever, Post-Processors, and Response Synthesizer to process natural-language queries and generate structured, file-referencing outputs. The system uses a knowledge base built from diverse data, converts non-text data into text via AI-generated descriptions, and stores semantic embeddings in Pinecone using a model like BGE-M3, with a hybrid scoring mechanism that balances lexical and semantic signals via a weighting parameter $\alpha$ and an explicit equation for the Hybrid Score. Experimental results show that a Mistral 7B model often yields the best precision and F1, that tuning $\alpha$ improves retrieval quality, and that translator and router components are critical for accuracy, with ablations highlighting trade-offs between performance and latency. The study demonstrates the potential of AI-powered archival search to deliver precise, multilingual, and user-friendly access to multimodal digital archives, informing future work on multilingual robustness and scalable deployment.

Abstract

This study presents a novel framework for smart search in digital archival systems, leveraging the capabilities of Large Language Models (LLMs) to enhance information retrieval. By employing a Retrieval-Augmented Generation (RAG) approach, the framework enables the processing of natural language queries and transforming non-textual data into meaningful textual representations. The system integrates advanced metadata generation techniques, a hybrid retrieval mechanism, a router query engine, and robust response synthesis, the results proved search precision and relevance. We present the architecture and implementation of the system and evaluate its performance in four experiments concerning LLM efficiency, hybrid retrieval optimizations, multilingual query handling, and the impacts of individual components. Obtained results show significant improvements over conventional approaches and have demonstrated the potential of AI-powered systems to transform modern archival practices.

A Proposed Large Language Model-Based Smart Search for Archive System

TL;DR

and an explicit equation for the Hybrid Score. Experimental results show that a Mistral 7B model often yields the best precision and F1, that tuning

improves retrieval quality, and that translator and router components are critical for accuracy, with ablations highlighting trade-offs between performance and latency. The study demonstrates the potential of AI-powered archival search to deliver precise, multilingual, and user-friendly access to multimodal digital archives, informing future work on multilingual robustness and scalable deployment.

Abstract

Paper Structure (27 sections, 1 equation, 8 figures, 3 tables)

This paper contains 27 sections, 1 equation, 8 figures, 3 tables.

Introduction
Backgrounds and Preliminaries
Archive System
Search Features in Archive System
Traditional Search
LLM-based Search
Metadata in Archive System
A Proposed Framework for Smart Search in Archive System
Knowledge Base Creation
Data Collection and Preparation
Embedding Creation and Indexing
The Architecture of the Proposed Smart Search System.
Translator (Query and Response)
Router Query Engine
Hybrid Retriever
...and 12 more sections

Figures (8)

Figure 1: Illustration of User Input and Corresponding Desired Output in Our Proposed System
Figure 2: The Architecture of the Proposed Smart Search System.
Figure 3: Hybrid Retriever.
Figure 4: Response Synthesizer.
Figure 5: Illustration of retrieved files from the proposed system: In this example, although the retriever extracted four files, only two files are mentioned in the LLM's response. Consequently, the retrieved files are identified as 5138120512 and 1466458735.
...and 3 more figures

A Proposed Large Language Model-Based Smart Search for Archive System

TL;DR

Abstract

A Proposed Large Language Model-Based Smart Search for Archive System

Authors

TL;DR

Abstract

Table of Contents

Figures (8)