Analysis of Systems' Performance in Natural Language Processing Competitions
Sergio Nava-Muñoz, Mario Graff, Hugo Jair Escalante
TL;DR
The paper addresses the challenge of robustly comparing participants in NLP competitions when only prediction data are available and datasets are fixed. It proposes a bootstrap-based evaluation framework that computes confidence intervals for individual performances and for differences relative to the winner, complemented by significance testing and multiple-comparisons corrections. The methodology is demonstrated across eight NLP competitions, providing metrics for competition difficulty (PPI) and competitiveness (ties, CV, win-med gap) and enabling off-the-shelf analysis via the CompStats tool. The approach offers a practical, statistically sound foundation for fair winner determination and cross-task comparison, with broad applicability to classification and regression tasks in collaborative challenges.
Abstract
Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms' performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems' performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.
