Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Leon Weber-Genzel, Robert Litschko, Ekaterina Artemova, Barbara Plank
TL;DR
This work introduces Donkii, the first benchmark for applying Annotation Error Detection to instruction-tuning data, and presents a taxonomy of error types that affect InstT datasets. It proposes four training-dynamics–based AED baselines adapted for generative outputs and evaluates them across three Donkii datasets, revealing that the right method and model size are crucial for effective data cleaning. The study demonstrates substantial improvements over random baselines, notes dataset- and category-specific performance trends, and provides practical guidance on using AED to improve the quality of instruction-following data. Overall, Donkii enables systematic evaluation of data-quality interventions in instruction-tuned LLM pipelines and highlights implications for downstream model quality and reliability.
Abstract
Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: DONKII. It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data. We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them extensively on the newly introduced dataset. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.
