Table of Contents
Fetching ...

Testing Research Software: An In-Depth Survey of Practices, Methods, and Tools

Nasir U. Eisty, Upulee Kanewala, Jeffrey C. Carver

TL;DR

The paper presents an in-depth survey of testing practices in research software, revealing wide variation in testing processes, limited use of standard metrics, and a strong influence of domain, team size, and dedicated testing resources on testing approaches. It highlights core challenges such as test case design and the oracle problem, and demonstrates a gap between industrial testing tools and the specialized needs of researchers. The findings justify investment in targeted training, resource allocation, and the development of domain-aware testing tools to improve reliability, reproducibility, and efficiency in scientific software. Overall, the work calls for a cultural and tool-building shift to integrate testing more deeply into research software development and to support the evolving, complex nature of scientific computations.

Abstract

Context: Research software is essential for developing advanced tools and models to solve complex research problems and drive innovation across domains. Therefore, it is essential to ensure its correctness. Software testing plays a vital role in this task. However, testing research software is challenging due to the software's complexity and to the unique culture of the research software community. Aims: Building on previous research, this study provides an in-depth investigation of testing practices in research software, focusing on test case design, challenges with expected outputs, use of quality metrics, execution methods, tools, and desired tool features. Additionally, we explore whether demographic factors influence testing processes. Method: We survey research software developers to understand how they design test cases, handle output challenges, use metrics, execute tests, and select tools. Results: Research software testing varies widely. The primary challenges are test case design, evaluating test quality, and evaluating the correctness of test outputs. Overall, research software developers are not familiar with existing testing tools and have a need for new tools to support their specific needs. Conclusion: Allocating human resources to testing and providing developers with knowledge about effective testing techniques are important steps toward improving the testing process of research software. While many industrial testing tools exist, they are inadequate for testing research software due to its complexity, specialized algorithms, continuous updates, and need for flexible, custom testing approaches. Access to a standard set of testing tools that address these special characteristics will increase level of testing in research software development and reduce the overhead of distributing knowledge about software testing.

Testing Research Software: An In-Depth Survey of Practices, Methods, and Tools

TL;DR

The paper presents an in-depth survey of testing practices in research software, revealing wide variation in testing processes, limited use of standard metrics, and a strong influence of domain, team size, and dedicated testing resources on testing approaches. It highlights core challenges such as test case design and the oracle problem, and demonstrates a gap between industrial testing tools and the specialized needs of researchers. The findings justify investment in targeted training, resource allocation, and the development of domain-aware testing tools to improve reliability, reproducibility, and efficiency in scientific software. Overall, the work calls for a cultural and tool-building shift to integrate testing more deeply into research software development and to support the evolving, complex nature of scientific computations.

Abstract

Context: Research software is essential for developing advanced tools and models to solve complex research problems and drive innovation across domains. Therefore, it is essential to ensure its correctness. Software testing plays a vital role in this task. However, testing research software is challenging due to the software's complexity and to the unique culture of the research software community. Aims: Building on previous research, this study provides an in-depth investigation of testing practices in research software, focusing on test case design, challenges with expected outputs, use of quality metrics, execution methods, tools, and desired tool features. Additionally, we explore whether demographic factors influence testing processes. Method: We survey research software developers to understand how they design test cases, handle output challenges, use metrics, execute tests, and select tools. Results: Research software testing varies widely. The primary challenges are test case design, evaluating test quality, and evaluating the correctness of test outputs. Overall, research software developers are not familiar with existing testing tools and have a need for new tools to support their specific needs. Conclusion: Allocating human resources to testing and providing developers with knowledge about effective testing techniques are important steps toward improving the testing process of research software. While many industrial testing tools exist, they are inadequate for testing research software due to its complexity, specialized algorithms, continuous updates, and need for flexible, custom testing approaches. Access to a standard set of testing tools that address these special characteristics will increase level of testing in research software development and reduce the overhead of distributing knowledge about software testing.

Paper Structure

This paper contains 61 sections, 9 figures, 46 tables.

Figures (9)

  • Figure 1: Survey questions - Part I
  • Figure 2: Survey questions - Part II
  • Figure 3: Whether the testing process is ad-hoc or systematic: 1 - ad-hoc and 7 - systematic (Q11)
  • Figure 4: Whether the testing process is manual or automated: 1: least automated and 7: most automated (Q12)
  • Figure 5: Ranking of testing challenges. Numbers on the right indicate the ranking given to a particular testing task based on how challenging it is by the respondents. 1 - most challenging and 7 - least challenging (Q15)
  • ...and 4 more figures