Table of Contents
Fetching ...

SemEval-2025 Task 9: The Food Hazard Detection Challenge

Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren, Juli Bakagianni

TL;DR

This paper presents SemEval 2025 Task 9, the Food Hazard Detection Challenge, which targets explainable text classification for food-incident reports across two subtasks: coarse-grained hazard/product category prediction (ST1) and fine-grained vector-level prediction (ST2). It compares encoder-only, encoder-decoder, and decoder-only transformers and demonstrates that large-language-model–generated synthetic data can effectively address long-tail distributions, with an overall emphasis on hazard accuracy via a macro $F_1$-based scoring scheme. The study analyzes participant systems (≈260 entrants, 99 submissions, 27 system descriptions), highlighting that richer input features, ensemble methods, and synthetic data contribute most to performance, while no single transformer architecture consistently dominates. The findings underscore the potential of synthetic data and model ensembles for real-world food-hazard information extraction, and they identify key avenues for future work, including vector-task difficulty, explainability, and robust debugging in imbalanced, domain-specific datasets.

Abstract

In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

SemEval-2025 Task 9: The Food Hazard Detection Challenge

TL;DR

This paper presents SemEval 2025 Task 9, the Food Hazard Detection Challenge, which targets explainable text classification for food-incident reports across two subtasks: coarse-grained hazard/product category prediction (ST1) and fine-grained vector-level prediction (ST2). It compares encoder-only, encoder-decoder, and decoder-only transformers and demonstrates that large-language-model–generated synthetic data can effectively address long-tail distributions, with an overall emphasis on hazard accuracy via a macro -based scoring scheme. The study analyzes participant systems (≈260 entrants, 99 submissions, 27 system descriptions), highlighting that richer input features, ensemble methods, and synthetic data contribute most to performance, while no single transformer architecture consistently dominates. The findings underscore the potential of synthetic data and model ensembles for real-world food-hazard information extraction, and they identify key avenues for future work, including vector-task difficulty, explainability, and robust debugging in imbalanced, domain-specific datasets.

Abstract

In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.

Paper Structure

This paper contains 21 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The columns in the blue boxes were available to the participants to serve as model input, while the orange boxes comprised the ground truth labels per sub-task. The number on the right of each label indicated the number of unique values per label.
  • Figure 2: Timeline of the challenge: (a) Trial Phase: Training data was provided before the challenge commenced. (b) Conception Phase: Example code, along with unlabeled validation and test data, was released at the beginning of the challenge. During this phase, participants could submit separate trial entries for ST1 (category classification) and ST2 ("vector" classification) using the validation data. (c) Evaluation Phase: The validation data was made available, and final submissions for both tasks were accepted on the test data to determine the final ranking.
  • Figure 3: Overview over the data used in the challenge
  • Figure 4: Frequency distribution of system attributes. Each subplot represents a distinct attribute, illustrating the choices made by the participating systems in terms of features, task treatment, model types, ensemble strategies, model availability, and data usage.
  • Figure 5: Average score achieved and number of submissions per combination of input features used (-- ST1, -- ST2). The horizontal bars show minimum and maximum score and the number of samples is annotated as $n$.
  • ...and 5 more figures