Jury: A Comprehensive Evaluation Toolkit

Devrim Cavusoglu; Secil Sen; Ulas Sert; Sinan Altinuc

Jury: A Comprehensive Evaluation Toolkit

Devrim Cavusoglu, Secil Sen, Ulas Sert, Sinan Altinuc

TL;DR

The paper tackles the fragmentation of NLP evaluation across diverse tasks and metrics by introducing Jury, a comprehensive evaluation toolkit built atop the evaluate library. It delivers a unified metric interface, supports concurrent computation, and enables evaluation with multiple predictions and references as well as task-aware metric mappings. The key contributions include a unified Metric-Jury design, multi-metric evaluation in a single run, and seamless compatibility with existing metrics via evaluate, all underpinned by a robust, arrow-table based computation framework. This work aims to standardize and accelerate NLP evaluation, improving reproducibility and efficiency for researchers and practitioners alike.

Abstract

Evaluation plays a critical role in deep learning as a fundamental block of any prediction-based system. However, the vast number of Natural Language Processing (NLP) tasks and the development of various metrics have led to challenges in evaluating different systems with different metrics. To address these challenges, we introduce jury, a toolkit that provides a unified evaluation framework with standardized structures for performing evaluation across different tasks and metrics. The objective of jury is to standardize and improve metric evaluation for all systems and aid the community in overcoming the challenges in evaluation. Since its open-source release, jury has reached a wide audience and is available at https://github.com/obss/jury.

Jury: A Comprehensive Evaluation Toolkit

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 5 figures, 1 table)

This paper contains 13 sections, 1 equation, 5 figures, 1 table.

Introduction
Related Work
Jury/Library Overview
Design & Structures
Metric
Unified Interface
Task Assigned Metrics
Scorer
Library Tour
Discussion
Conclusion
Computation Schema of Metrics
Experiments on Runtime

Figures (5)

Figure 1: General computation scheme of jury. (a) Generally (and by default) both reduce operations are max, optionally these can be determined separately. (b) The reduce is generally arithmetic-mean for most of the metrics.
Figure 2: Experiments conducted against other open-source frameworks where (a) compares how well the tools scale in terms of input size and (b) compares how well the tools scale in terms of number of metrics. The evaluation is done on throughput (items/s) which is scaled by $log_2$ in both of the plots. The results shown in figures are average of 5 runs.
Figure :
Figure :
Figure :

Jury: A Comprehensive Evaluation Toolkit

TL;DR

Abstract

Jury: A Comprehensive Evaluation Toolkit

Authors

TL;DR

Abstract

Table of Contents

Figures (5)