Table of Contents
Fetching ...

LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements

Jianwei Wang, Mengqi Wang, Yinsi Zhou, Zhenchang Xing, Qing Liu, Xiwei Xu, Wenjie Zhang, Liming Zhu

TL;DR

HSE-Bench introduces the first IRAC-based benchmark for evaluating LLMs on health, safety, and environment compliance, drawing from regulations, court cases, safety exams, and field videos. It systematically analyzes prompting strategies and proposes Reasoning of Experts (RoE), a multi-expert simulation to improve principled regulatory reasoning. Empirical results reveal that current models rely heavily on semantic matching and lack structured legal reasoning, though RoE yields meaningful accuracy gains and highlights the value of expert-like reasoning. The work provides a framework and dataset to push toward safer, more reliable LSE decision-support in high-risk industrial contexts and suggests clear directions for future improvement.

Abstract

Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.

LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements

TL;DR

HSE-Bench introduces the first IRAC-based benchmark for evaluating LLMs on health, safety, and environment compliance, drawing from regulations, court cases, safety exams, and field videos. It systematically analyzes prompting strategies and proposes Reasoning of Experts (RoE), a multi-expert simulation to improve principled regulatory reasoning. Empirical results reveal that current models rely heavily on semantic matching and lack structured legal reasoning, though RoE yields meaningful accuracy gains and highlights the value of expert-like reasoning. The work provides a framework and dataset to push toward safer, more reliable LSE decision-support in high-risk industrial contexts and suggests clear directions for future improvement.

Abstract

Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.

Paper Structure

This paper contains 29 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Conceptual framework of this project, featuring benchmark construction, comprehensive performance evaluation and new advancement to apply LLM for HSE compliance assessment
  • Figure 2: Overall results on different sources of data. Without options, LLMs show a sharp performance drop from $\sim90$% accuracy to $\sim70$% AUC-ROC, indicating a reliance on semantic matching between questions and options. Moreover, reasoning models fail to outperform foundation models.
  • Figure 3: Prompt strategies evaluation on both DeepSeek-V3 and DeepSeek-R1. Our RoE prompt significantly outperforms other prompting strategies on both LLMs.
  • Figure 4: IRAC reasoning evaluation. LLMs show a significant performance drop in the full IRAC pipeline, indicating difficulty in applying legal reasoning for HSE compliance.
  • Figure 5: Prompt strategies evaluation (AUC-ROC).