Decoding complexity: how machine learning is redefining scientific discovery

Ricardo Vinuesa; Paola Cinnella; Jean Rabault; Hossein Azizpour; Stefan Bauer; Bingni W. Brunton; Arne Elofsson; Elias Jarlebring; Hedvig Kjellstrom; Stefano Markidis; David Marlevi; Javier Garcia-Martinez; Steven L. Brunton

Decoding complexity: how machine learning is redefining scientific discovery

Ricardo Vinuesa, Paola Cinnella, Jean Rabault, Hossein Azizpour, Stefan Bauer, Bingni W. Brunton, Arne Elofsson, Elias Jarlebring, Hedvig Kjellstrom, Stefano Markidis, David Marlevi, Javier Garcia-Martinez, Steven L. Brunton

TL;DR

It is argued that the scientific community can move beyond the necessary traditional oversimplifications to embrace the full complexity of natural systems, ultimately paving the way for interdisciplinary breakthroughs and innovative solutions to humanity's most pressing challenges.

Abstract

As modern scientific instruments generate vast amounts of data and the volume of information in the scientific literature continues to grow, machine learning (ML) has become an essential tool for organising, analysing, and interpreting these complex datasets. This paper explores the transformative role of ML in accelerating breakthroughs across a range of scientific disciplines. By presenting key examples -- such as brain mapping and exoplanet detection -- we demonstrate how ML is reshaping scientific research. We also explore different scenarios where different levels of knowledge of the underlying phenomenon are available, identifying strategies to overcome limitations and unlock the full potential of ML. Despite its advances, the growing reliance on ML poses challenges for research applications and rigorous validation of discoveries. We argue that even with these challenges, ML is poised to disrupt traditional methodologies and advance the boundaries of knowledge by enabling researchers to tackle increasingly complex problems. Thus, the scientific community can move beyond the necessary traditional oversimplifications to embrace the full complexity of natural systems, ultimately paving the way for interdisciplinary breakthroughs and innovative solutions to humanity's most pressing challenges.

Decoding complexity: how machine learning is redefining scientific discovery

TL;DR

Abstract

Paper Structure (12 sections, 5 figures, 1 table)

This paper contains 12 sections, 5 figures, 1 table.

Introduction
Embracing complexity
Discovery versus re-discovery by machine learning
Machine-learning-driven scientific discovery when complete information is available
Machine-learning-driven scientific discovery when only partial information is available
Machine-learning-driven scientific discovery when little information is available
The drawbacks, limitations and challenges of machine learning for scientific discovery
Conclusions and outlook
Acknowledgements
Author contributions
Competing interests
Publisher’s note

Figures (5)

Figure 1: Schematic representation of the various applications of ML for scientific discovery, depending on the amount of knowledge available in each category. A number of examples are provided, including brain research, drug discovery, dark matter and fluid mechanics.
Figure 2: Visual summary of the paper illustrating some of its main ideas. (Top left) The increase in the amount of data generated by scientific instrumentation over time and the shift from data organization by humans, computers, and finally by machine-learning techniques, which translates into less observation, intervention and understanding of humans on scientific discoveries. (Top right) The four challenges we have identified in the paper on the use of ML techniques for scientific discovery, these being: data quality and availability, potential biases, explainability and overfitting. (Bottom left) What constitutes a scientific discovery and the possibility of ML making original breakthroughs versus simply rediscovering known ideas, concepts or laws. (Bottom right) The need to have more and better data to be able to make scientific discoveries with ML as we have less knowledge on the subject of study
Figure 3: Schematic representation of ML directions to enable scientific discoveries when complete information about the governing equations is available. In such a case, both supervised, unsupervised, and reinforcement-learning methodologies can be used. Supervised and unsupervised methodologies are made possible by generating large datasets of synthetic data simulated from the governing equations. This allows the deployment of a variety of ML techniques that can discover complex hidden relations, nonlinear coordinate systems, hidden dynamics or solve otherwise intractable problems. Reinforcement learning can also be used by coupling it to the physics simulator, which has already proven successful at discovering previously unknown control strategies and regimes of complex systems or generating high-quality heuristic guesses that can be tested in the case of problems where solution verification is easy, but the suggestion of good candidate solutions is hard.
Figure 4: Example of machine learning applied to a case where partial knowledge is available about the underlying system, illustrating a model (for instance a flow with complex rheology or a flow through a porous medium, top right of the picture) which depends on a set of known inputs $\mathbf{x}$ ( e.g. geometry, boundary conditions, etc.) as well as on a set of hidden (unobservable) variables $\boldsymbol{\alpha}$ describing, e.g., the fluid constitutive behavior. The latter may involve small-scale phenomena that can be difficult or impossible to describe. In such conditions, experimental or numerical data for observable quantities ( e.g. velocity fields or stresses $\mathbf{y}$) can be used to infer the unknown field by training a machine learning model (here represented as a neural network, although other ML approaches are possible), subjected to physical constraints ( e.g. positivity, symmetries or invariances). The whole process allows, on the one hand, to train a data-driven closure model for the hidden variables $\boldsymbol{\alpha}$ and, on the other hand, to gain a-posteriori physical knowledge of the fluid constitutive properties.
Figure 5: Schematic representation of a model (for instance, the observed symptoms of an unknown or complex disease within a population, or observed opinion dynamics within a social network) where the behavior as observed in data depends on an unknown dynamic or causal structure. The observed behavior or dynamics might occur on several different spatial and temporal scales, and the observed data might reflect more or fewer aspects of the underlying system. In such conditions, representation-learning methods can be employed to distil out an explanation of the observed data in the form of a system of ODEs or as a causal-graph representation.

Decoding complexity: how machine learning is redefining scientific discovery

TL;DR

Abstract

Decoding complexity: how machine learning is redefining scientific discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (5)