Efficient argument classification with compact language models and ChatGPT-4 refinements
Marcin Pietron, Rafał Olszowski, Jakub Gomułka
TL;DR
The paper tackles argument classification within argument mining by proposing a hybrid pipeline that combines compact transformer models (e.g., DistilBERT and BERT) with refinements from ChatGPT-4. The evaluation spans three datasets—US2016, UKP, and Args.me—and employs an uncertainty-based routing threshold, denoted by $\gamma$, to delegate a portion of uncertain cases to the large language model: $\gamma \approx 0.20$ for Args.me and $\gamma \approx 0.25$ for US2016. Results show the BERT+ChatGPT-4 ensemble outperforms compact baselines and LSTM models, with key gains on Args.me (Top1/F1 up to 91.41/92.92) and notable improvements on US2016 (F1 ≈ 72.5%) and UKP (F1 ≈ 68.5%). The findings suggest a practical path for efficient argument classification by pairing fast CLMs with selective LLM refinement, and point to future work on open-source LLMs and advanced prompting strategies.
Abstract
Argument mining (AM) is defined as the task of automatically identifying and extracting argumentative components (e.g. premises, claims, etc.) and detecting the existing relations among them (i.e., support, attack, no relations). Deep learning models enable us to analyze arguments more efficiently than traditional methods and extract their semantics. This paper presents comparative studies between a few deep learning-based models in argument mining. The work concentrates on argument classification. The research was done on a wide spectrum of datasets (Args.me, UKP, US2016). The main novelty of this paper is the ensemble model which is based on BERT architecture and ChatGPT-4 as fine tuning model. The presented results show that BERT+ChatGPT-4 outperforms the rest of the models including other Transformer-based and LSTM-based models. The observed improvement is, in most cases, greater than 10The presented analysis can provide crucial insights into how the models for argument classification should be further improved. Additionally, it can help develop a prompt-based algorithm to eliminate argument classification errors.
