Table of Contents
Fetching ...

A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts

Jhon Rayo, Raul de la Rosa, Mario Garrido

TL;DR

Regulatory texts are lengthy and highly specialized, creating challenges for information retrieval and question answering. The authors propose a hybrid information retrieval pipeline that merges BM25 lexical ranking with a fine-tuned Sentence Transformer to achieve semantic matching, integrated within a Retrieval Augmented Generation framework to synthesize answers from retrieved passages. Empirical results showthat the hybrid retriever outperforms purely lexical or semantic baselines on Recall@10 and MAP@10, and that GPT-3.5 Turbo within the RAG setup delivers high-quality, policy-aligned answers, with the fine-tuned model and methodology openly released to foster further development in regulatory NLP. Overall, the work advances practical regulatory information access by combining robust retrieval with generative synthesis in an open, domain-focused framework.

Abstract

Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.

A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts

TL;DR

Regulatory texts are lengthy and highly specialized, creating challenges for information retrieval and question answering. The authors propose a hybrid information retrieval pipeline that merges BM25 lexical ranking with a fine-tuned Sentence Transformer to achieve semantic matching, integrated within a Retrieval Augmented Generation framework to synthesize answers from retrieved passages. Empirical results showthat the hybrid retriever outperforms purely lexical or semantic baselines on Recall@10 and MAP@10, and that GPT-3.5 Turbo within the RAG setup delivers high-quality, policy-aligned answers, with the fine-tuned model and methodology openly released to foster further development in regulatory NLP. Overall, the work advances practical regulatory information access by combining robust retrieval with generative synthesis in an open, domain-focused framework.

Abstract

Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.

Paper Structure

This paper contains 7 sections, 3 tables.