InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

Jing Ding; Kai Feng; Binbin Lin; Jiarui Cai; Qiushi Wang; Yu Xie; Xiaojin Zhang; Zhongyu Wei; Wei Chen

InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen

TL;DR

InsQABench presents a comprehensive benchmark for Chinese insurance QA, categorizing tasks into Commonsense QA, Database QA, and Clause QA to reflect real-world knowledge types. It introduces two task-specific methods, SQL-ReAct for structured data and RAG-ReAct for unstructured clause documents, and demonstrates that supervised fine-tuning with LoRA substantially improves domain alignment. The dataset construction combines large-scale data collection with expert input, evolutionary question diversification, and PDF-driven clause extraction, accompanied by thorough experimental evaluations showing gains over baselines and competitive performance against proprietary models. This work offers a solid foundation for applying and advancing LLMs in high-stakes insurance contexts, with open data and code to support broader adoption and further research.

Abstract

The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering tasks.We also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at https://github.com/HaileyFamo/InsQABench.git.

InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

TL;DR

Abstract

Paper Structure (42 sections, 2 equations, 14 figures, 12 tables)

This paper contains 42 sections, 2 equations, 14 figures, 12 tables.

Introduction
Related Work
Method
Supervised Fine-Tuning (SFT)
SQL-ReAct
RAG-ReAct
InsQABench Dataset
Insurance Commonsense QA
Training Set for Commonsense QA
Test Set for Commonsense QA
Insurance Database QA
Database Construction
Training Set for Database QA
Test Set for Database QA
Insurance Clause QA
...and 27 more sections

Figures (14)

Figure 1: Overview of the InsQABench benchmark, illustrating the multi-faceted insurance knowledge system and fine-tuned LLMs utilizing SQL-ReAct and RAG-ReAct for task-specific enhancements.
Figure 2: Examples in the InsQABench Dataset.
Figure 3: The construction process of the Database QA Dataset.
Figure 4: The construction process of the Clause QA Dataset.
Figure 5: The types and topics of the questions in the three tasks.
...and 9 more figures

InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

TL;DR

Abstract

InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)