Table of Contents
Fetching ...

Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports

Jian Chen, Jiabao Dou

TL;DR

The paper tackles severe class imbalance and data scarcity in occupational accident report classification by introducing ABEX-RAT, a resource-efficient framework that pairs ABEX data augmentation with random adversarial training. It uses a prompt-guided abstraction to distill label-critical semantics, followed by diversity-driven expansion to synthesize minority-class samples, and a fixed embedding extractor with a lightweight RAT classifier. Empirical results on the OSHA dataset show state-of-the-art Macro-F1 of 90.32% and Weighted-F1 of 92.82%, outperforming traditional baselines and large-model fine-tuning while maintaining efficiency. The approach demonstrates that targeted data enrichment combined with robust regularization can achieve high accuracy in specialized domains without costly full-parameter LLM fine-tuning.

Abstract

The automatic classification of occupational accident reports is pivotal for workplace safety analysis but is persistently hindered by severe class imbalance and data scarcity. In this paper, we propose ABEX-RAT, a resource-efficient framework that synergizes generative data augmentation with robust adversarial learning. Unlike computationally expensive large language models (LLMs) fine-tuning, our approach employs a two-stage abstractive-expansive (ABEX) pipeline: it first utilizes a prompt-guided LLM to distill label-critical semantics into concise abstracts, which are then expanded into diverse synthetic samples to balance the data distribution. Subsequently, we train a lightweight classifier using a random adversarial training (RAT) protocol, which stochastically injects perturbations to enhance generalization without significant computational overhead. Experimental results on the OSHA dataset demonstrate that ABEXRAT establishes a new state-of-the-art, achieving a Macro-F1 score of 90.32% and significantly outperforming both traditional baselines and fine-tuned large models. This confirms that targeted augmentation combined with robust training offers a superior, data-efficient alternative for specialized domain classification. The source code will be made publicly available upon acceptance.

Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports

TL;DR

The paper tackles severe class imbalance and data scarcity in occupational accident report classification by introducing ABEX-RAT, a resource-efficient framework that pairs ABEX data augmentation with random adversarial training. It uses a prompt-guided abstraction to distill label-critical semantics, followed by diversity-driven expansion to synthesize minority-class samples, and a fixed embedding extractor with a lightweight RAT classifier. Empirical results on the OSHA dataset show state-of-the-art Macro-F1 of 90.32% and Weighted-F1 of 92.82%, outperforming traditional baselines and large-model fine-tuning while maintaining efficiency. The approach demonstrates that targeted data enrichment combined with robust regularization can achieve high accuracy in specialized domains without costly full-parameter LLM fine-tuning.

Abstract

The automatic classification of occupational accident reports is pivotal for workplace safety analysis but is persistently hindered by severe class imbalance and data scarcity. In this paper, we propose ABEX-RAT, a resource-efficient framework that synergizes generative data augmentation with robust adversarial learning. Unlike computationally expensive large language models (LLMs) fine-tuning, our approach employs a two-stage abstractive-expansive (ABEX) pipeline: it first utilizes a prompt-guided LLM to distill label-critical semantics into concise abstracts, which are then expanded into diverse synthetic samples to balance the data distribution. Subsequently, we train a lightweight classifier using a random adversarial training (RAT) protocol, which stochastically injects perturbations to enhance generalization without significant computational overhead. Experimental results on the OSHA dataset demonstrate that ABEXRAT establishes a new state-of-the-art, achieving a Macro-F1 score of 90.32% and significantly outperforming both traditional baselines and fine-tuned large models. This confirms that targeted augmentation combined with robust training offers a superior, data-efficient alternative for specialized domain classification. The source code will be made publicly available upon acceptance.

Paper Structure

This paper contains 18 sections, 7 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The problem of category imbalance in occupational accident reports: a case study of OSHA dataset.
  • Figure 2: The overall architecture of our proposed ABEX-RAT framework. The framework consists of three stages. Stage 1 (ABEX Data Augmentation): A large language model (e.g., Qwen3) generates a concise abstract from each raw text, which is then expanded into multiple augmented samples by a BART-based model. Stage 2 (Feature Extraction): A pre-trained embedding model converts all texts into dense semantic vectors. Stage 3 (RAT & Classification): A lightweight MLP classifier is trained on these vectors using a total loss that stochastically combines a standard loss with an adversarial loss to improve model robustness.
  • Figure 3: The specific prompt design used in the ABEX abstraction phase. The placeholders {category}, {keywords}, and {full_text} are dynamically filled during inference.
  • Figure 4: Results of the ablation experiment.
  • Figure 5: Normalized confusion matrix.
  • ...and 1 more figures