Data Augmentation to Improve Large Language Models in Food Hazard and Product Detection
Areeg Fahad Rasheed, M. Zarkoosh, Shimam Amer Chasib, Safa F. Abbas
TL;DR
The paper tackles the challenge of text-based food-hazard and product-category classification in the presence of data imbalance. It employs data augmentation with ChatGPT-4o-mini to enrich the training set and evaluates two fine-tuned LLMs, RoBERTa-base and Flan-T5-base, on hazard- and product-category tasks, reporting improvements in recall, precision, F1, and accuracy. Key findings show that augmentation benefits both models, with Flan-T5 generally achieving higher macro-F1 and RoBERTa offering greater training efficiency; the larger Flan-T5 model benefits more from increased data, while RoBERTa remains a resource-efficient option. The work provides practical guidance on when to prefer RoBERTa with augmentation versus Flan-T5 with augmentation, highlighting the method's potential to enhance safety-related text classification at scale.
Abstract
The primary objective of this study is to demonstrate the impact of data augmentation using ChatGPT-4o-mini on food hazard and product analysis. The augmented data is generated using ChatGPT-4o-mini and subsequently used to train two large language models: RoBERTa-base and Flan-T5-base. The models are evaluated on test sets. The results indicate that using augmented data helped improve model performance across key metrics, including recall, F1 score, precision, and accuracy, compared to using only the provided dataset. The full code, including model training and the augmented dataset, can be found in this repository: https://github.com/AREEG94FAHAD/food-hazard-prdouct-cls
