Table of Contents
Fetching ...

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

David Beauchemin, Richard Khoury

TL;DR

AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks is introduced, suggesting that while current architectures approach expert-level proficiency, the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.

Abstract

The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

TL;DR

AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks is introduced, suggesting that while current architectures approach expert-level proficiency, the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.

Abstract

The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.
Paper Structure (27 sections, 2 figures, 4 tables)

This paper contains 27 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Example of a translated APEC-QA question, choices and response along with the prompt used for the evaluation. Blue box contains the task instructions. Yellow box contains the prefix for the model to continue. Texts in "$\ll\gg$" are role-tags to be fed to the model.
  • Figure 2: Accuracy scores with (y-axis) and without (x-axis) RAG system. Black dashed lines are our Random baseline scores. The dotted lines represent the statistical significance boundaries ($Z=\pm 3.29$, $\alpha=0.001$). Points falling within the central cone indicate no statistically significant difference between closed-book and RAG performance. Points above the upper dotted line show significant improvement with RAG, and points below mean the opposite. Red dots are models that performed poorer than the baseline on one of the corpora, green dots are models that performed better than 60% on both corpora, while blue dots are those that do not fit in the two other performance classes.