Table of Contents
Fetching ...

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Liang Zhang, Katherine Jijo, Spurthi Setty, Eden Chung, Fatima Javid, Natan Vidra, Tommy Clifford

TL;DR

Large language models suffer from hallucinations and inaccuracies in high-stakes finance QA. The paper proposes a fine-tuning pipeline with human-labeled data, retrieval augmentation via FLARE and HyDE, and parameter-efficient tuning (LoRA/QLoRA), evaluated across multiple models and datasets. Results show that fine-tuning with labeled data and retrieval-augmented QA systematically improves accuracy over zero-shot baselines, with additional gains from advanced retrieval techniques. The work provides practical, scalable strategies for domain-specific QA in finance, highlighting trade-offs between open-source and closed-model deployments and outlining concrete directions for further enhancement.

Abstract

Large Language Models (LLMs) generate responses to questions; however, their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models. The objective is to enhance AI models through continuous feedback loops, utilizing metrics such as cosine similarity, LLM evaluation and Rouge-L scores to evaluate the models. Leveraging LLMs like GPT-3.5, GPT4ALL, and LLaMA2, and Claude, this approach is benchmarked on financial datasets, including the FinanceBench and RAG Instruct Benchmark Tester Dataset, illustrating the necessity of fine-tuning. The results showcase the capability of fine-tuned models to surpass the accuracy of zero-shot LLMs, providing superior question and answering capabilities. Notably, the combination of fine-tuning the LLM with a process known as Retrieval Augmented Generation (RAG) proves to generate responses with improved accuracy.

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

TL;DR

Large language models suffer from hallucinations and inaccuracies in high-stakes finance QA. The paper proposes a fine-tuning pipeline with human-labeled data, retrieval augmentation via FLARE and HyDE, and parameter-efficient tuning (LoRA/QLoRA), evaluated across multiple models and datasets. Results show that fine-tuning with labeled data and retrieval-augmented QA systematically improves accuracy over zero-shot baselines, with additional gains from advanced retrieval techniques. The work provides practical, scalable strategies for domain-specific QA in finance, highlighting trade-offs between open-source and closed-model deployments and outlining concrete directions for further enhancement.

Abstract

Large Language Models (LLMs) generate responses to questions; however, their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models. The objective is to enhance AI models through continuous feedback loops, utilizing metrics such as cosine similarity, LLM evaluation and Rouge-L scores to evaluate the models. Leveraging LLMs like GPT-3.5, GPT4ALL, and LLaMA2, and Claude, this approach is benchmarked on financial datasets, including the FinanceBench and RAG Instruct Benchmark Tester Dataset, illustrating the necessity of fine-tuning. The results showcase the capability of fine-tuned models to surpass the accuracy of zero-shot LLMs, providing superior question and answering capabilities. Notably, the combination of fine-tuning the LLM with a process known as Retrieval Augmented Generation (RAG) proves to generate responses with improved accuracy.
Paper Structure (17 sections, 8 figures)

This paper contains 17 sections, 8 figures.

Figures (8)

  • Figure 1: Retrieval-Augmented Generation
  • Figure 2: Fine Tuning Methods and LoRA performance
  • Figure 3: Parameter Efficient Fine Tuning
  • Figure 4: Sample of FinanceBench Dataset
  • Figure 5: Sample of RAG Instruct Benchmark Tester Dataset
  • ...and 3 more figures