Table of Contents
Fetching ...

A Fine-tuning Enhanced RAG System with Quantized Influence Measure as AI Judge

Keshav Rangan, Yiqiao Yin

TL;DR

This work advances retrieval-augmented generation by fusing fine-tuned LLMs with vector databases to improve chatbot performance in shelter contexts. It introduces LoRA/QLoRA-based parameter-efficient fine-tuning, a production-oriented architecture, and a Quantized Influence Measure (QIM) acting as an AI Judge to refine result ranking. The approach is validated on a proprietary, low-volume dataset derived from YSA materials, demonstrating superior performance for RAG configurations, especially when augmented with QIM. The framework emphasizes user feedback integration and public availability of data and tools, signaling practical impact for social-good applications and broader retrieval-augmented system development.

Abstract

This study presents an innovative enhancement to retrieval-augmented generation (RAG) systems by seamlessly integrating fine-tuned large language models (LLMs) with vector databases. This integration capitalizes on the combined strengths of structured data retrieval and the nuanced comprehension provided by advanced LLMs. Central to our approach are the LoRA and QLoRA methodologies, which stand at the forefront of model refinement through parameter-efficient fine-tuning and memory optimization. A novel feature of our research is the incorporation of user feedback directly into the training process, ensuring the model's continuous adaptation to user expectations and thus, improving its performance and applicability. Additionally, we introduce a Quantized Influence Measure (QIM) as an innovative "AI Judge" mechanism to enhance the precision of result selection, further refining the system's accuracy. Accompanied by an executive diagram and a detailed algorithm for fine-tuning QLoRA, our work provides a comprehensive framework for implementing these advancements within chatbot technologies. This research contributes significant insights into LLM optimization for specific uses and heralds new directions for further development in retrieval-augmented models. Through extensive experimentation and analysis, our findings lay a robust foundation for future advancements in chatbot technology and retrieval systems, marking a significant step forward in the creation of more sophisticated, precise, and user-centric conversational AI systems.

A Fine-tuning Enhanced RAG System with Quantized Influence Measure as AI Judge

TL;DR

This work advances retrieval-augmented generation by fusing fine-tuned LLMs with vector databases to improve chatbot performance in shelter contexts. It introduces LoRA/QLoRA-based parameter-efficient fine-tuning, a production-oriented architecture, and a Quantized Influence Measure (QIM) acting as an AI Judge to refine result ranking. The approach is validated on a proprietary, low-volume dataset derived from YSA materials, demonstrating superior performance for RAG configurations, especially when augmented with QIM. The framework emphasizes user feedback integration and public availability of data and tools, signaling practical impact for social-good applications and broader retrieval-augmented system development.

Abstract

This study presents an innovative enhancement to retrieval-augmented generation (RAG) systems by seamlessly integrating fine-tuned large language models (LLMs) with vector databases. This integration capitalizes on the combined strengths of structured data retrieval and the nuanced comprehension provided by advanced LLMs. Central to our approach are the LoRA and QLoRA methodologies, which stand at the forefront of model refinement through parameter-efficient fine-tuning and memory optimization. A novel feature of our research is the incorporation of user feedback directly into the training process, ensuring the model's continuous adaptation to user expectations and thus, improving its performance and applicability. Additionally, we introduce a Quantized Influence Measure (QIM) as an innovative "AI Judge" mechanism to enhance the precision of result selection, further refining the system's accuracy. Accompanied by an executive diagram and a detailed algorithm for fine-tuning QLoRA, our work provides a comprehensive framework for implementing these advancements within chatbot technologies. This research contributes significant insights into LLM optimization for specific uses and heralds new directions for further development in retrieval-augmented models. Through extensive experimentation and analysis, our findings lay a robust foundation for future advancements in chatbot technology and retrieval systems, marking a significant step forward in the creation of more sophisticated, precise, and user-centric conversational AI systems.
Paper Structure (12 sections, 8 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 8 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Executive Diagram for Fine-tuning Enhanced RAG. The fine-tune enhanced RAG algorithm integrates vector database queries with fine-tuned LLM insights to generate contextually rich and accurate responses.
  • Figure 2: Comparison of the behavior between cosine and QIM. Graphical analysis of how vector size affects the relationship between cosine similarity and quantized influence measure. For vectors of size 10, we observe that the signals are random. However, in practice the embedding layers produce vector representation of size 1000 or above. The simulation shows that for vectors of size 1000 the value of quantized influence measure increases exponentially. For the extremely high similarity content, it is much easier to use quantized influence measure to filter and select the relevant content/reference in the RAG algorithm. The quantized concept is a tuning parameter and the experiment shows $q$-bit can be changed from 4 to 32, i.e. delivering better results but with longer time consumption. To select the $q$ parameter, it is worth noting that the higher $q$ values lead to more densely generated partitions, but the calculation of the Quantized Influence Measure would also take longer time.
  • Figure 3: Proposed System Architect. This executive diagram explains the system architecture to implement the proposed method in a chatbot. (1) The training data is created using the "text-generation" style. This is a dictionary with "Human" and "Assistant" referring to question-answer pairs. This gives us the training data for fine-tuning models. (2) We fine tune large language models based on proposed methods discussed in the previous sections (using Algorithm \ref{['algo:fine-tune-qlora']}). (3) A vector database is created and this collection stores data in the vector form (we use "chroma" library). (4) The query search the vector database against the user's question/prompt and return selections with distance score. This allows us to filter against a certain threshold, i.e. 0.2. (5) We take the question/prompt, the answer and the reference and display that on the screen for the user. (6) We can ask for user feedback and save the cache to a directory for next-stage training purpose, because we can train another model to learn the user preference. (7) We use the proposed quantized influence measure as an additional "AI Judge" to help us rank the results in the fourth step. (8) We use the feedback provided from the user to enhance the training data in the first step. (9) In the end, the last step proposes to package the code into one API and have a cleaner version in one software package for technical user to have programmatic access.
  • Figure 4: Executive Diagram for Data Processing. This diagram illustrates the process flow from extracting text from PDF documents to generating and organizing conversational Q&A data for fine-tuning Large Language Models (LLMs).