Table of Contents
Fetching ...

Identifying Quantum Mechanical Statistics in Italian Corpora

Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Massimiliano Sassoli de Bianchi, Sandro Sozzo

TL;DR

The paper investigates whether word frequencies in human language exhibit quantum statistical patterns, extending prior findings from English to Italian texts. It develops a theoretical framework that maps words to energy levels and analyzes large Italian corpora using Bose--Einstein versus Maxwell--Boltzmann statistics, finding that Bose--Einstein statistics accurately models word distributions and reveals meaning-driven, entanglement-like correlations. The authors further show that word randomization acts like a temperature increase, reducing coherence and making classical statistics more applicable, which supports a decoherence-inspired interpretation of meaning in language. The results endorse a language-general, meaning-driven mechanism for quantum statistics in cognition, motivate a conceptuality interpretation of quantum mechanics, and point toward a quantum-thermodynamic treatment of information and language with potential cross-domain insights for physics.

Abstract

We present a theoretical and empirical investigation of the statistical behaviour of the words in a text produced by human language. To this aim, we analyse the word distribution of various texts of Italian language selected from a specific literary corpus. We firstly generalise a theoretical framework elaborated by ourselves to identify 'quantum mechanical statistics' in large-size texts. Then, we show that, in all analysed texts, words distribute according to 'Bose--Einstein statistics' and show significant deviations from 'Maxwell--Boltzmann statistics'. Next, we introduce an effect of 'word randomization' which instead indicates that the difference between the two statistical models is not as pronounced as in the original cases. These results confirm the empirical patterns obtained in texts of English language and strongly indicate that identical words tend to 'clump together' as a consequence of their meaning, which can be explained as an effect of 'quantum entanglement' produced through a phenomenon of 'contextual updating'. More, word randomization can be seen as the linguistic-conceptual equivalent of an increase of temperature which destroys 'coherence' and makes classical statistics prevail over quantum statistics. Some insights into the origin of quantum statistics in physics are finally provided.

Identifying Quantum Mechanical Statistics in Italian Corpora

TL;DR

The paper investigates whether word frequencies in human language exhibit quantum statistical patterns, extending prior findings from English to Italian texts. It develops a theoretical framework that maps words to energy levels and analyzes large Italian corpora using Bose--Einstein versus Maxwell--Boltzmann statistics, finding that Bose--Einstein statistics accurately models word distributions and reveals meaning-driven, entanglement-like correlations. The authors further show that word randomization acts like a temperature increase, reducing coherence and making classical statistics more applicable, which supports a decoherence-inspired interpretation of meaning in language. The results endorse a language-general, meaning-driven mechanism for quantum statistics in cognition, motivate a conceptuality interpretation of quantum mechanics, and point toward a quantum-thermodynamic treatment of information and language with potential cross-domain insights for physics.

Abstract

We present a theoretical and empirical investigation of the statistical behaviour of the words in a text produced by human language. To this aim, we analyse the word distribution of various texts of Italian language selected from a specific literary corpus. We firstly generalise a theoretical framework elaborated by ourselves to identify 'quantum mechanical statistics' in large-size texts. Then, we show that, in all analysed texts, words distribute according to 'Bose--Einstein statistics' and show significant deviations from 'Maxwell--Boltzmann statistics'. Next, we introduce an effect of 'word randomization' which instead indicates that the difference between the two statistical models is not as pronounced as in the original cases. These results confirm the empirical patterns obtained in texts of English language and strongly indicate that identical words tend to 'clump together' as a consequence of their meaning, which can be explained as an effect of 'quantum entanglement' produced through a phenomenon of 'contextual updating'. More, word randomization can be seen as the linguistic-conceptual equivalent of an increase of temperature which destroys 'coherence' and makes classical statistics prevail over quantum statistics. Some insights into the origin of quantum statistics in physics are finally provided.

Paper Structure

This paper contains 7 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We report the numbers of appearance $N(E_i)$ of the words in the text "Cuore" ("Heart"), ranked from lowest energy level, i.e. the most frequent word, to highest energy level, i.e. the least frequent word. The blue graph (almost entirely below the red graph, see Fig. \ref{['heart_2']}) corresponds to empirical data, i.e. the collected numbers of appearance from the text, the red graph is a Bose--Einstein distribution model for the same numbers of appearance, and the green graph is a Maxwell--Boltzmann distribution model. The red and blue graphs coincide almost completely, whereas the green graph presents large deviations from the blue graph of the data. This shows that the Bose--Einstein distribution does provide a good model for the numbers of appearance, while the Maxwell--Boltzmann distribution does not.
  • Figure 2: We report the $\log / \log$ graphs of the numbers of appearance of Fig. \ref{['heart_1']}, and their Bose--Einstein and Maxwell--Boltzmann distribution models. As already noted in Fig. \ref{['heart_1']}, the red and blue graphs coincide almost completely, whereas the green graph presents large deviations from the blue graph of the data. This shows that the Bose--Einstein distribution does provide a good model for the numbers of appearance, while the Maxwell--Boltzmann distribution does not.
  • Figure 3: We report the energy distribution of the text "Cuore" ("Heart"). More precisely, the blue graph reports the energy $E_iN(E_i)$ of the text per energy level $E_i=i$, the red graph reports the same energy per energy level as modelled by the Bose--Einstein distribution, while the green graph, reports the energy per energy level as modelled by the Maxwell--Boltzmann distribution.
  • Figure 4: We report the numbers of appearance $N(E_i)$ of the words in the randomized version of the text "Senilità" ("As a Man Grows Older"), ranked from lowest energy level, corresponding to the most frequent word, to highest energy level, corresponding to the least frequent word. The blue graph represents empirical data, i.e. the collected numbers of appearance from the randomized text, the red graph is a Bose--Einstein distribution model for the same numbers of appearance, and the green graph is a Maxwell--Boltzmann distribution model. The Bose--Einstein distribution is still a good model for the numbers of appearance, but the quality of the Maxwell--Boltzmann distribution model is now much better.
  • Figure 5: We report the $\log / \log$ graphs of the numbers of appearance of Fig. \ref{['senilitarandom_1']}, and their Bose--Einstein and Maxwell--Boltzmann distribution models. As already noted in Fig. \ref{['senilitarandom_1']}, the Bose--Einstein distribution is still a good model for the numbers of appearance, but the quality of the Maxwell--Boltzmann distribution model is now much better.
  • ...and 1 more figures