CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance
Jinru Ding, Chao Ding, Wenrao Pang, Boyi Xiao, Zhiqiang Liu, Pengcheng Chen, Jiayuan Chen, Tiantian Yuan, Junming Guan, Yidong Jiang, Dawei Cheng, Jie Xu
TL;DR
The paper introduces CNFinBench, a holistic benchmark for evaluating LLM safety, compliance, and capability in finance by organizing tasks into Capability, Compliance & Risk Control, and Safety. It constructs a large, rigorously validated dataset (13k+ single-turn items and 400 multi-turn dialogues) drawn from authoritative financial materials, and introduces the Harmful Instruction Compliance Score (HICS) alongside a three-LLM judge ensemble with expert calibration. The findings reveal a persistent gap between capability and compliance, with safety proving vulnerable under multi-turn adversarial dialogue and finance-specific tuning not guaranteeing superior holistic performance. The work emphasizes the need for auditable, rule-based reasoning and adversarially robust evaluation to ensure safe and compliant deployment of LLMs in high-stakes financial contexts.
Abstract
Large language models (LLMs) are increasingly deployed across the financial sector for tasks like investment research and algorithmic trading. Their high-stakes nature demands rigorous evaluation of models' safety and regulatory alignment. However, there is a significant gap between evaluation capabilities and safety requirements. Current financial benchmarks mainly focus on textbook-style question answering and numerical problem-solving, failing to simulate the open-ended scenarios where safety risks typically manifest. To close these gaps, we introduce CNFinBench, a benchmark structured around a Capability-Compliance-Safety triad encompassing 15 subtasks. For Capability Q&As, we introduce a novel business-vertical taxonomy aligned with core financial domains like banking operations, which allows institutions to assess model readiness for deployment in operational scenarios. For Compliance and Risk Control Q&As, we embed regulatory requirements within realistic business scenarios to ensure models are evaluated under practical, scenario-driven conditions. For Safety Q&As, we uniquely incorporate structured bias and fairness auditing, a dimension overlooked by other holistic financial benchmarks, and introduce the first multi-turn adversarial dialogue task to systematically expose compliance decay under sustained, context-aware attacks. Accordingly, we propose the Harmful Instruction Compliance Score (HICS) to quantify models' consistency in resisting harmful instructions across multi-turn dialogues. Experiments on 21 models across all subtasks reveal a persistent gap between capability and compliance: models achieve an average score of 61.0 on capability tasks but drop to 34.2 on compliance and risk-control evaluations. In multi-turn adversarial dialogue tests, most LLMs attain only partial resistance, demonstrating that refusal alone is insufficient without cited, verifiable reasoning.
