Table of Contents
Fetching ...

When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP

Sara Papi, Marco Gaido, Andrea Pilzer, Matteo Negri

TL;DR

Reproducibility alone does not ensure correctness in NLP research software. The authors present a Conformer case study showing that multiple open-source implementations contain padding-related bugs, yet can still yield good and reproducible results that mislead conclusions. They demonstrate that building on incorrect code can artificially inflate perceived gains when evaluating methods such as CTC compression. To improve trust, they release a corrected Conformer, introduce pangoliNN for neural-network testing, and propose a Code-quality Checklist to promote software quality in the NLP community.

Abstract

Despite its crucial role in research experiments, code correctness is often presumed only on the basis of the perceived quality of results. This assumption comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As a countermeasure, we propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models, with the goal of promoting coding best practices and improving research software quality within the NLP community.

When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP

TL;DR

Reproducibility alone does not ensure correctness in NLP research software. The authors present a Conformer case study showing that multiple open-source implementations contain padding-related bugs, yet can still yield good and reproducible results that mislead conclusions. They demonstrate that building on incorrect code can artificially inflate perceived gains when evaluating methods such as CTC compression. To improve trust, they release a corrected Conformer, introduce pangoliNN for neural-network testing, and propose a Code-quality Checklist to promote software quality in the NLP community.

Abstract

Despite its crucial role in research experiments, code correctness is often presumed only on the basis of the perceived quality of results. This assumption comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As a countermeasure, we propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models, with the goal of promoting coding best practices and improving research software quality within the NLP community.
Paper Structure (22 sections, 2 figures, 8 tables)

This paper contains 22 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Example of relative shift operation starting from a Relative PE matrix containing padding (1), both considering a codebase with beetle3 (2) and without (3) bug. The first row is always discarded.
  • Figure 2: Convolution module in the Conformer encoder layer. Convolutional blocks are 1D convolutions.