Table of Contents
Fetching ...

Observations on Building RAG Systems for Technical Documents

Sumit Soman, Sujoy Roychowdhury

TL;DR

This study investigates how Retrieval Augmented Generation performs on technical documents in telecom, focusing on how chunk size, glossary handling, and retrieval strategies affect QA quality. It evaluates MPNet-based embeddings and a Llama2-7b-chat model on 42 domain questions drawn from IEEE standards to compare glossary- and full-document retrieval, with a finding that sentence-based retrieval and definition-term splitting improve results. The work reveals that embedding similarity signals are brittle across chunk sizes and that threshold-based retriever augmentation can be unreliable, highlighting practical constraints for long-form technical QA. It also points to domain-aligned evaluation metrics and follow-up-question capabilities as important directions for future RAG systems.

Abstract

Retrieval augmented generation (RAG) for technical documents creates challenges as embeddings do not often capture domain information. We review prior art for important factors affecting RAG and perform experiments to highlight best practices and potential challenges to build RAG systems for technical documents.

Observations on Building RAG Systems for Technical Documents

TL;DR

This study investigates how Retrieval Augmented Generation performs on technical documents in telecom, focusing on how chunk size, glossary handling, and retrieval strategies affect QA quality. It evaluates MPNet-based embeddings and a Llama2-7b-chat model on 42 domain questions drawn from IEEE standards to compare glossary- and full-document retrieval, with a finding that sentence-based retrieval and definition-term splitting improve results. The work reveals that embedding similarity signals are brittle across chunk sizes and that threshold-based retriever augmentation can be unreliable, highlighting practical constraints for long-form technical QA. It also points to domain-aligned evaluation metrics and follow-up-question capabilities as important directions for future RAG systems.

Abstract

Retrieval augmented generation (RAG) for technical documents creates challenges as embeddings do not often capture domain information. We review prior art for important factors affecting RAG and perform experiments to highlight best practices and potential challenges to build RAG systems for technical documents.
Paper Structure (7 sections, 1 figure, 1 table)

This paper contains 7 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: The distribution of similarities across 10974 documents of various sizes split by number of words in the document