Table of Contents
Fetching ...

Local Compositional Complexity: How to Detect a Human-readable Messsage

Louis Mahon

TL;DR

This work tackles the problem of identifying meaningful, human‑readable messages in static data by proposing Local Compositional Complexity (LCC), a computable metric based on a two‑part description that separates structured (A) from unstructured (B) content. Local compositionality is defined as a locally tree‑like structure that captures the meaningful organization in data, and the LCC score is the length of the structured portion in the optimal description under MDL/MML principles. The authors demonstrate the approach across discrete (text), continuous (images, audio), and cross‑domain data, showing that random or repetitive data yield low LCC while real human signals yield high LCC, with the Arecibo message serving as a compelling extraterrestrial‑signal example. The study links the framework to entropy and macrostate concepts, discusses compression implications, and argues that LCC can help distinguish meaningful content from noise, with potential applications in detecting non‑human communication. Overall, LCC provides a principled, computable measure of meaningful complexity that spans domains and can inform signal understanding in both terrestrial and potential extraterrestrial contexts.

Abstract

Data complexity is an important concept in the natural sciences and related areas, but lacks a rigorous and computable definition. In this paper, we focus on a particular sense of complexity that is high if the data is structured in a way that could serve to communicate a message. In this sense, human speech, written language, drawings, diagrams and photographs are high complexity, whereas data that is close to uniform throughout or populated by random values is low complexity. We describe a general framework for measuring data complexity based on dividing the shortest description of the data into a structured and an unstructured portion, and taking the size of the former as the complexity score. We outline an application of this framework in statistical mechanics that may allow a more objective characterisation of the macrostate and entropy of a physical system. Then, we derive a more precise and computable definition geared towards human communication, by proposing local compositionality as an appropriate specific structure. We demonstrate experimentally that this method can distinguish meaningful signals from noise or repetitive signals in auditory, visual and text domains, and could potentially help determine whether an extra-terrestrial signal contained a message.

Local Compositional Complexity: How to Detect a Human-readable Messsage

TL;DR

This work tackles the problem of identifying meaningful, human‑readable messages in static data by proposing Local Compositional Complexity (LCC), a computable metric based on a two‑part description that separates structured (A) from unstructured (B) content. Local compositionality is defined as a locally tree‑like structure that captures the meaningful organization in data, and the LCC score is the length of the structured portion in the optimal description under MDL/MML principles. The authors demonstrate the approach across discrete (text), continuous (images, audio), and cross‑domain data, showing that random or repetitive data yield low LCC while real human signals yield high LCC, with the Arecibo message serving as a compelling extraterrestrial‑signal example. The study links the framework to entropy and macrostate concepts, discusses compression implications, and argues that LCC can help distinguish meaningful content from noise, with potential applications in detecting non‑human communication. Overall, LCC provides a principled, computable measure of meaningful complexity that spans domains and can inform signal understanding in both terrestrial and potential extraterrestrial contexts.

Abstract

Data complexity is an important concept in the natural sciences and related areas, but lacks a rigorous and computable definition. In this paper, we focus on a particular sense of complexity that is high if the data is structured in a way that could serve to communicate a message. In this sense, human speech, written language, drawings, diagrams and photographs are high complexity, whereas data that is close to uniform throughout or populated by random values is low complexity. We describe a general framework for measuring data complexity based on dividing the shortest description of the data into a structured and an unstructured portion, and taking the size of the former as the complexity score. We outline an application of this framework in statistical mechanics that may allow a more objective characterisation of the macrostate and entropy of a physical system. Then, we derive a more precise and computable definition geared towards human communication, by proposing local compositionality as an appropriate specific structure. We demonstrate experimentally that this method can distinguish meaningful signals from noise or repetitive signals in auditory, visual and text domains, and could potentially help determine whether an extra-terrestrial signal contained a message.
Paper Structure (26 sections, 20 equations, 8 figures, 3 tables)

This paper contains 26 sections, 20 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Example, dummy descriptions, of different descriptions of physical systems, showing how high entropy and low complexity can come apart given a varying overall description length.
  • Figure 2: The description cost, in bits, for text in different natural languages, along with simplified English sentences and random text (top) and for artificially generated repetitive text (bottom). This is broken down into the cost to describe the model, the cost to index into that model to describe part of the data ('idx cost'), and the residual portion not accounted for by the model. Portion A of the description comprises the model cost and the index cost. This constitutes the lcc score and is marked in green. Random text has the highest total cost but the lowest meaningful cost.
  • Figure 3: Top: Toy example image. Middle: The MDL cluster labels for this image on the first level. One this level, two clusters are found, corresponding to blue-ish and green-ish pixels, and the meaningful portions, i.e. those pixels assigned a cluster centroid, and not marked with 'x', are mostly in the bottom left and top right. Bottom: The MDL cluster labels on the second level. One this level, there are again two clusters, now corresponding to green-ish and blue-ish patches, and the meaningful portions are entirely in the bottom left and top right.
  • Figure 4: The mean description cost, in bits, over 100 randomly sampled images of various types (one example displayed below each column). The cost is broken down into that for the model ('model cost'), for indexing into that model to describe part of the data ('idx cost'), and for the residual portion not accounted for by the model ('residual cost'). Portion A of the description comprises the model cost and the index cost. This constitutes the lcc score and is marked in green. Random images have the highest total cost but the lowest meaningful cost. For readability, the displayed heights for the residual portion are reduced by a factor of 5.
  • Figure 5: The description cost, in bits, for different types of audio signals, broken down into the cost to describe the model, the cost to index into that model to describe part of the data ('idx cost'), and the residual portion not accounted for by the model. For readability, the displayed heights for the residual portion are reduced by a factor of 5. Portion A of the description comprises the model cost and the index cost. This constitutes the lcc score and is marked in green.
  • ...and 3 more figures