Table of Contents
Fetching ...

Accelerating scientific discovery with the common task framework

J. Nathan Kutz, Peter Battaglia, Michael Brenner, Kevin Carlberg, Aric Hagberg, Shirley Ho, Stephan Hoyer, Henning Lange, Hod Lipson, Michael W. Mahoney, Frank Noe, Max Welling, Laure Zanna, Francis Zhu, Steven L. Brunton

TL;DR

The paper proposes the Common Task Framework (CTF) to provide fair, objective benchmarking of ML/AI methods on dynamic systems with withheld test sets and multi-metric evaluation. It combines a permanent collection of toy dynamical models with rotating real-world datasets, structured around 12 challenge matrices $X_J$ and corresponding scores (e.g., $E_1$–$E_{12}$) to evaluate forecasting, reconstruction, limited-data performance, and parametric generalization. It situates CTF within a broader discussion of inductive versus deductive reasoning, advocating for physics-informed constraints and transparent evaluation workflows, including a Sage Bionetworks referee and GitHub-based reproducibility. The framework is designed to be laptop-accessible, scalable, and adaptable to diverse scientific domains, aiming to accelerate responsible, cross-disciplinary progress in data-driven science and engineering.

Abstract

Machine learning (ML) and artificial intelligence (AI) algorithms are transforming and empowering the characterization and control of dynamic systems in the engineering, physical, and biological sciences. These emerging modeling paradigms require comparative metrics to evaluate a diverse set of scientific objectives, including forecasting, state reconstruction, generalization, and control, while also considering limited data scenarios and noisy measurements. We introduce a common task framework (CTF) for science and engineering, which features a growing collection of challenge data sets with a diverse set of practical and common objectives. The CTF is a critically enabling technology that has contributed to the rapid advance of ML/AI algorithms in traditional applications such as speech recognition, language processing, and computer vision. There is a critical need for the objective metrics of a CTF to compare the diverse algorithms being rapidly developed and deployed in practice today across science and engineering.

Accelerating scientific discovery with the common task framework

TL;DR

The paper proposes the Common Task Framework (CTF) to provide fair, objective benchmarking of ML/AI methods on dynamic systems with withheld test sets and multi-metric evaluation. It combines a permanent collection of toy dynamical models with rotating real-world datasets, structured around 12 challenge matrices and corresponding scores (e.g., ) to evaluate forecasting, reconstruction, limited-data performance, and parametric generalization. It situates CTF within a broader discussion of inductive versus deductive reasoning, advocating for physics-informed constraints and transparent evaluation workflows, including a Sage Bionetworks referee and GitHub-based reproducibility. The framework is designed to be laptop-accessible, scalable, and adaptable to diverse scientific domains, aiming to accelerate responsible, cross-disciplinary progress in data-driven science and engineering.

Abstract

Machine learning (ML) and artificial intelligence (AI) algorithms are transforming and empowering the characterization and control of dynamic systems in the engineering, physical, and biological sciences. These emerging modeling paradigms require comparative metrics to evaluate a diverse set of scientific objectives, including forecasting, state reconstruction, generalization, and control, while also considering limited data scenarios and noisy measurements. We introduce a common task framework (CTF) for science and engineering, which features a growing collection of challenge data sets with a diverse set of practical and common objectives. The CTF is a critically enabling technology that has contributed to the rapid advance of ML/AI algorithms in traditional applications such as speech recognition, language processing, and computer vision. There is a critical need for the objective metrics of a CTF to compare the diverse algorithms being rapidly developed and deployed in practice today across science and engineering.

Paper Structure

This paper contains 4 sections, 9 equations, 6 figures.

Figures (6)

  • Figure 1: Illustration of classic CTF benchmark problems and modern CTF benchmark problems in dynamical systems. There is a progression of complexity, from static data, to interactive data where it is possible to run control experiments, to living cyberphysical systems. Top row: (left) Imagenet dataset deng2009imagenet; (middle) Atari video game environment in Gymnasium ravichandiran2018handsdutta2018reinforcement; (right) pendulum on a cart used for PILCO learner deisenroth2011pilco Bottom row: (left) Fluid flow past a cylinder and lava lamp; (middle) HydroGym cavity flow environment HydroGym; (right) Double pendulum kaheman2023experimental and Quanser Aero 2 platform traver2022digital.
  • Figure 2: Distribution of errors for fitting the data from 1000 random initial starts of the neural network. Note that for this application where the test set is in an extrapolation regime, the error distribution is significant, ranging from a minimal RMSE of approximately 29 to a value of 185. The average error is about 67. This highlights an important aspect of neural network training: the performance on a true withheld test set can have very high variance.
  • Figure 3: Environments in the AI Institute CTF for dynamical systems permanent collection. Included in the permanent challenge sets are three dynamical systems (Lorenz, Rössler, double-pendulum) and three spatio-temporal systems (Kuramoto-Sivashinsky, Lorenz96, Burgers). The user is provided with 10 training data sets, with the requirement of generating 9 test set approximations. A diversity of metrics are measured, including those related to forecasting and reconstruction with noisy measurements and limited data.
  • Figure 4: (a) From the 2nd century AD until the beginning of the scientific revolution (circa 1600), the deductive Ptolemaic system of the solar system was used to describe the motion of the planets as circles on circles. One can consider this as the precursor to the Fourier transform. (b) The inevitable push to a heliocentric coordinate system by Copernicus, Kepler and Galileo allowed Newton to develop an inductive theory of the solar system: ${\bf F}=m{\bf a}$. In modern data-driven science, deep learning can be framed towards (c) inductive models using autoencoder strategies to reduce dynamics to their intrinsic dimensions, or (d) deductive models which project to higher dimensional spaces and aim for accuracy and flexibility. Modern autonomy leverages sensors to build data-driven models which are physics-based inductive models for robotics (e) and which are non-physics based, deductive statistical models for self-driving cars.
  • Figure 5: Evaluation of forecasting and reconstruction capabilities of a method with numerically accurate data for both (a) short- and (b) long-time forecasting, (c) with limited data, (d) with and without noise, and (e) under parametric variability.
  • ...and 1 more figures