Table of Contents
Fetching ...

General Mechanism of Evolution Shared by Proteins and Words

Li-Min Wang, Hsing-Yi Lai, Sun-Ting Tsai, Chen Siang Ng, Kevin Sheng-Kai Ma, Shan-Jyun Wu, Meng-Xue Tsai, Yi-Ching Su, Daw-Wei Wang, Tzay-Ming Hong

Abstract

Complex systems, such as life and languages, are governed by principles of evolution. The analogy and comparison between biology and linguistics\cite{alphafold2, RoseTTAFold, lang_virus, cell language, faculty1, language of gene, Protein linguistics, dictionary, Grammar of pro_dom, complexity, genomics_nlp, InterPro, language modeling, Protein language modeling} provide a computational foundation for characterizing and analyzing protein sequences, human corpora, and their evolution. However, no general mathematical formula has been proposed so far to illuminate the origin of quantitative hallmarks shared by life and language. Here we show several new statistical relationships shared by proteins and words, which inspire us to establish a general mechanism of evolution with explicit formulations that can incorporate both old and new characteristics. We found natural selection can be quantified via the entropic formulation by the principle of least effort to determine the sequence variation that survives in evolution. Besides, the origin of power law behavior and how changes in the environment stimulate the emergence of new proteins and words can also be explained via the introduction of function connection network. Our results demonstrate not only the correspondence between genetics and linguistics over their different hierarchies but also new fundamental physical properties for the evolution of complex adaptive systems. We anticipate our statistical tests can function as quantitative criteria to examine whether an evolution theory of sequence is consistent with the regularity of real data. In the meantime, their correspondence broadens the bridge to exchange existing knowledge, spurs new interpretations, and opens Pandora's box to release several potentially revolutionary challenges. For example, does linguistic arbitrariness conflict with the dogma that structure determines function?

General Mechanism of Evolution Shared by Proteins and Words

Abstract

Complex systems, such as life and languages, are governed by principles of evolution. The analogy and comparison between biology and linguistics\cite{alphafold2, RoseTTAFold, lang_virus, cell language, faculty1, language of gene, Protein linguistics, dictionary, Grammar of pro_dom, complexity, genomics_nlp, InterPro, language modeling, Protein language modeling} provide a computational foundation for characterizing and analyzing protein sequences, human corpora, and their evolution. However, no general mathematical formula has been proposed so far to illuminate the origin of quantitative hallmarks shared by life and language. Here we show several new statistical relationships shared by proteins and words, which inspire us to establish a general mechanism of evolution with explicit formulations that can incorporate both old and new characteristics. We found natural selection can be quantified via the entropic formulation by the principle of least effort to determine the sequence variation that survives in evolution. Besides, the origin of power law behavior and how changes in the environment stimulate the emergence of new proteins and words can also be explained via the introduction of function connection network. Our results demonstrate not only the correspondence between genetics and linguistics over their different hierarchies but also new fundamental physical properties for the evolution of complex adaptive systems. We anticipate our statistical tests can function as quantitative criteria to examine whether an evolution theory of sequence is consistent with the regularity of real data. In the meantime, their correspondence broadens the bridge to exchange existing knowledge, spurs new interpretations, and opens Pandora's box to release several potentially revolutionary challenges. For example, does linguistic arbitrariness conflict with the dogma that structure determines function?

Paper Structure

This paper contains 5 sections, 24 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Schematic of the common evolution framework according to GLC and Tab. \ref{['hierarchy']}.
  • Figure 2: Panels (a, b, c) are the RRD plot for Human, the novels Frog, and The Hobbit. Their construction is demonstrated schematically in panels (d, e). The introduction of vertical $V_m$ and horizontal lines $H_n$ is instrumental to facilitate the understanding of scaling structure. See SI for more details.
  • Figure 3: Simple flowchart of the evolutionary algorithm in our mechanism of evolution. The Book denotes a sequence of proteins/words, as in Eq. (\ref{['Book']}). The step "Add $s_{q(t)}$ and build its FC" is the start of loop. It simulates the sequence variation that changes the length of Book $q(t)$. This loop will execute from $t=1$ until $t=L-q_0$ where $L$ is the final length of Book. The step "Small $\Omega({\bf A})$" uses the principle of least effort to simulate natural selection. The step "Mutation" simulates the sequence variation which does not change $q(t)$. See METHOD for details and SI for the fast algorithm.
  • Figure 4: (a) FRD and (b) RRD of our simulation exhibit scaling structure. The exponent $b$ of FRD $\rho_x$ can be varied by adjusting the mutation rate $P_{\text{mu}}$. As $b = 0.4\sim 0.7$, the simulation behaves like life; while $b = 0.8 \sim 1$, it behaves like language. See METHOD and SI for details, and Extended Fig. \ref{['more_simulation']} for parameters and other characteristics.
  • Figure 5: The concepts, $f_1, f_2$, and $f_3$, are connected via their first syllagram in (a) or second one in (b). The thicker line indicates a stronger FC. When words for $f_1$ and $f_2$ already exist, we want to come up with a new word for $f_3$ which is more likely to adopt the syllagram from the old word that exhibits a stronger connection. Here, $f_3$ may be associated with "reflect".
  • ...and 8 more figures