Towards Using Multiple Iterated, Reproduced, and Replicated Experiments with Robots (MIRRER) for Evaluation and Benchmarking

Adam Norton; Brian Flynn

Towards Using Multiple Iterated, Reproduced, and Replicated Experiments with Robots (MIRRER) for Evaluation and Benchmarking

Adam Norton, Brian Flynn

TL;DR

The paper tackles the absence of standardized evaluation for generalizability and reproducibility in robotics. It introduces MIRRER, a conceptual framework that unifies performance evaluation, benchmarking, and reproduced/replicated experimentation, to enable cross‑lab comparability of robotic capabilities. It provides formal definitions for context, the component under evaluation (CUE), generalizability, and the three experimental modalities (iterated, reproduced, replicated), and discusses local relative ranking (LRR) as a viable benchmarking approach across labs. The discussion highlights infrastructural and incentive gaps—context capture standards, modular hardware/software pipelines (e.g., GRASPA), data sharing, and publication incentives—that must be addressed to operationalize MIRRER, with initial lab experiments and future multi‑lab collaborations planned. Overall, MIRRER aims to drive more reliable, comparable robotics research by formalizing evaluation protocols and encouraging reproducible benchmarking practices.

Abstract

The robotics research field lacks formalized definitions and frameworks for evaluating advanced capabilities including generalizability (the ability for robots to perform tasks under varied contexts) and reproducibility (the performance of a reproduced robot capability in different labs under the same experimental conditions). This paper presents an initial conceptual framework, MIRRER, that unites the concepts of performance evaluation, benchmarking, and reproduced/replicated experimentation in order to facilitate comparable robotics research. Several open issues with the application of the framework are also presented.

Towards Using Multiple Iterated, Reproduced, and Replicated Experiments with Robots (MIRRER) for Evaluation and Benchmarking

TL;DR

Abstract

Towards Using Multiple Iterated, Reproduced, and Replicated Experiments with Robots (MIRRER) for Evaluation and Benchmarking

Authors

TL;DR

Abstract

Table of Contents

Figures (1)