On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

Abicumaran Uthamacumaran; Felipe S. Abrahão; Narsis A. Kiani; Hector Zenil

On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

Abicumaran Uthamacumaran, Felipe S. Abrahão, Narsis A. Kiani, Hector Zenil

TL;DR

The paper challenges Assembly Theory (AT) by arguing that its molecular assembly index (MA) is not a novel complexity measure but a dictionary-based lossless compression analogue, essentially equivalent to $LZ77/LZ78$-style coding. Through cross-data benchmarks on mass spectrometry signatures, InChI strings, and bond-distance matrices, MA shows strong correlations with standard compression schemes (e.g., $1D$-RLE, $1D$-Huffman, $LZW$) and fails to outperform established algorithmic-information measures such as $BDM$ and related CTM-inspired approaches. The authors also demonstrate a deceiving-molecule phenomenon where objects with high MA can arise from simple, resource-bounded processes, leading to false positives and challenging claims that MA uniquely detects life or extraterrestrial biosignatures. They argue for adopting algorithmic-information frameworks that account for environment, modularity, and higher-order causality, rather than relying on MA/AT alone, to robustly discriminate biosignatures across data representations. Overall, the work clarifies significant limitations of MA and urges a shift toward intrinsic complexity measures rooted in algorithmic information theory for life-detection and biosignature analysis.

Abstract

We demonstrate that the assembly pathway method underlying assembly theory (AT) is an encoding scheme widely used by popular statistical compression algorithms. We show that in all cases (synthetic or natural) AT performs similarly to other simple coding schemes and underperforms compared to system-related indexes based upon algorithmic probability that take into account statistical repetitions but also the likelihood of other computable patterns. Our results imply that the assembly index does not offer substantial improvements over existing methods, including traditional statistical ones, and imply that the separation between living and non-living compounds following these methods has been reported before.

On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

TL;DR

-style coding. Through cross-data benchmarks on mass spectrometry signatures, InChI strings, and bond-distance matrices, MA shows strong correlations with standard compression schemes (e.g.,

-RLE,

-Huffman,

) and fails to outperform established algorithmic-information measures such as

and related CTM-inspired approaches. The authors also demonstrate a deceiving-molecule phenomenon where objects with high MA can arise from simple, resource-bounded processes, leading to false positives and challenging claims that MA uniquely detects life or extraterrestrial biosignatures. They argue for adopting algorithmic-information frameworks that account for environment, modularity, and higher-order causality, rather than relying on MA/AT alone, to robustly discriminate biosignatures across data representations. Overall, the work clarifies significant limitations of MA and urges a shift toward intrinsic complexity measures rooted in algorithmic information theory for life-detection and biosignature analysis.

Abstract

Paper Structure (18 sections, 5 theorems, 11 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 5 theorems, 11 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
What a ZIP file can tell about life
MA and compression algorithms
Limitations of MA as a complexity measure
Discussion: emergence and intrinsic complexity measures
Mischaracterisations
On dictionary-based algorithms
Methods
Description of Algorithmic Complexity Measures
PyBDM Code for CTM and BDM
Expected false positives from complexity-deceiving molecules with arbitrarily high statistical significance
Mathematical framework and assumptions
Definitions and notation
Theoretical results
Empirical results
...and 3 more sections

Key Result

Lemma D.1

Let $\mathcal{ S }$ be infinite computably enumerable. Let $\mathbf{F}$ be an arbitrary formal theory that contains assembly theory, including all the decidable procedures of the chosen method for calculating the assembly index (or approximating MA) of an object for a nested subspace of $\mathcal{ S where the function $\; c_\Gamma \colon \Gamma \subset \mathcal{ S } \to \mathbb{N} \;$ gives the MA

Figures (5)

Figure 1: Classification of molecular complexity by multiple complexity indexes originally used to create the chemical space for the mass spectroscopy (MS) profiles (log-scale). A strong Pearson correlation with an R-value of 0.8823 was observed between 1D-BDM and MA for the 99 molecules available in the MS data set. LZW compression shared a close Pearson's correlation score of 0.8738 with MA. All correlation measures obtained a statistically significant one-tailed p-value ($P < 0.0001$). All measures other than MA applied to bond molecular distance matrices, some of which outperform MA and mass spectra at distinguishing organic from non-organic molecules found in the MS dataset of the MA paper cronin, as demonstrated by greater separation and smaller variance results across the different complexity measures among the molecular subgroups. MA does not display any particular advantage when compared against proper control experiments, and performs similarly to the simplest of the statistical algorithms applied to all the tested data representations, including molecular distance matrices (as shown here for all measures but MA) or the mass spectral data provided by the authors of Assembly Theory (shown on the plot from the authors' results that could not be fully reproduced due to lack of data made available in cronin but which we took at face value) for comparison purposes.
Figure 2: Analysis of organic versus non-organic molecules from mass spectral data by multiple complexity indexes: The strongest positive correlation was identified between MA and 1D-RLE coding (R= 0.9), which is one of the most basic coding schemes and among the most similar to the intended definition of MA, as being capable of 'counting copies' in 18 extracts for which the mass spectra was available. Other coding algorithms, including LZ and Huffman coding (R = 0.896), also show a strong positive correlation with MA. As seen, the compression values of both 1D-RLE and 1D-Huffman coding show overlapping and nearly identical medians (horizontal line at centre) and ranges on the whisker plot. The analysis further confirms our previous findings, with the similarity in performance in classifying living vs$.$ non-living between MA and popular statistical compression measures (whose purpose is also to count identical statistical copies) leading us to make the case that MA is one (and the same as compression).
Figure 3: ABRACADABRA tree diagrams for AT (A) and dynamic Huffman coding (B), both computable measures trivial to calculate. Huffman's was the first dictionary-based coding algorithm and is an optimal coding method able to characterise every statistical redundancy, including modularity, independent of such copy data representation abrahao2024.The (molecular) assembly index has been proven to be equivalent to LZ77/LZ78 abrahao2024. In this example, Huffman's (which is also a sequential lossless compression algorithm that traverses strings from left to right) collapses the compression tree into a 4-level tree, while MA's is a 7-level tree. No natural evidence indicates that the assembly index (or MA) corresponds better to how nature works. However, the assembly index is identical to LZ compression abrahao2024. In both cases, the resulting tree of this word problem characterises the same token and is able to reconstruct it in full, without any loss of information, by exploiting redundancy (identical copies) producing a set of possible cause-and-effect chains for which no empirical evidence exists in support of MA. Both LZ and Huffman, just as MA, converge to the same Shannon Entropy rate and can be used to guide a search in chemical space.
Figure 4: Correlation plot between the 'Molecular Assembly' (MA) index (taken at face value, not recalculated only reclassified according to a larger category space as in zenilchem) and other compression scores on InChI codes as performed in zenilchem with the molecules in marshall_murray_cronin_2017. The vertical axes are the five complexity scores in log normalised scale for comparison purposes.
Figure 5: Same analysis of the application of multiple statistical indexes on the same set but according to the categories in cronin showing the same separating properties. The strongest Pearson correlation was identified between 1D-BDM and the category of molecules (R= 0.828; P$<$0.0001).

Theorems & Definitions (10)

Lemma D.1
proof
Lemma D.2
proof
Lemma D.3
proof
Theorem D.3.1
proof
Corollary D.3.1.1
proof

On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

TL;DR

Abstract

On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (10)