Techniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents
Chejui Liao, Tabish Maniar, Sravanajyothi N, Anantha Sharma
TL;DR
The paper tackles the problem of obtaining accurate Q&A answers from transformer models on long, complex documents by proposing a structured text-processing pipeline. It introduces techniques including Definition Tokenization, Dependency Tokenization, Paragraph Splitting, Relevant Paragraph Ranking, Soundex Encoding, and targeted BERT Fine-tuning, combining them to reduce input size while preserving answer relevance. Empirical results on regulatory documents show substantial improvements in F1 accuracy (roughly 30–50% gain) when using the processing pipeline, approaching the performance of a manually curated paragraph input and outperforming processing that uses the full document. The work discusses trade-offs and limitations, such as tokenization consistency and dependency parse reliability, and points to future improvements in acronym handling and more robust text representations to further enhance practical Q&A systems for large-scale documents.
Abstract
This paper discusses the effectiveness of various text processing techniques, their combinations, and encodings to achieve a reduction of complexity and size in a given text corpus. The simplified text corpus is sent to BERT (or similar transformer based models) for question and answering and can produce more relevant responses to user queries. This paper takes a scientific approach to determine the benefits and effectiveness of various techniques and concludes a best-fit combination that produces a statistically significant improvement in accuracy.
