Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

David Beauchemin; Richard Khoury

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

David Beauchemin, Richard Khoury

TL;DR

AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks is introduced, suggesting that while current architectures approach expert-level proficiency, the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.

Abstract

The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

TL;DR

Abstract

Paper Structure (27 sections, 2 figures, 4 tables)

This paper contains 27 sections, 2 figures, 4 tables.

Introduction
Related Work
Legal and Insurance NLP Resources
Retrieval-Augmented Generation in Specialized Domains
Trustworthiness and Hallucination Benchmarking
Experimental Setup
Evaluation
Evaluation Benchmark
Evaluation Protocol and Metrics
Evaluated Models
Baselines
LLM
Results and Discussion
The Supremacy of Inference-Time Reasoning
Closed-Book versus RAG
...and 12 more sections

Figures (2)

Figure 1: Example of a translated APEC-QA question, choices and response along with the prompt used for the evaluation. Blue box contains the task instructions. Yellow box contains the prefix for the model to continue. Texts in "$\ll\gg$" are role-tags to be fed to the model.
Figure 2: Accuracy scores with (y-axis) and without (x-axis) RAG system. Black dashed lines are our Random baseline scores. The dotted lines represent the statistical significance boundaries ($Z=\pm 3.29$, $\alpha=0.001$). Points falling within the central cone indicate no statistically significant difference between closed-book and RAG performance. Points above the upper dotted line show significant improvement with RAG, and points below mean the opposite. Red dots are models that performed poorer than the baseline on one of the corpora, green dots are models that performed better than 60% on both corpora, while blue dots are those that do not fit in the two other performance classes.

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

TL;DR

Abstract

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)