Table of Contents
Fetching ...

Spontaneous emergence of linguistic statistical laws in images via artificial neural networks

Ping-Rui Tsai, Chi-hsiang Wang, Yu-Cheng Liao, Hong-Yue Huang, Tzay-Ming Hong

TL;DR

This work investigates whether images processed by vision-focused neural networks spontaneously develop language-like statistical structures. By treating convolutional kernels as visual words and counting highly active pixels, the authors show Zipf's law, Heaps' law, and Benford's law emerge in image-derived representations across multiple datasets and architectures, without explicit symbolic labeling. The analysis reveals that these laws are robust to various perturbations, though susceptibility varies by perturbation type and network design, with Benford's law showing notable resilience. The findings suggest that quasi-symbolic structures can arise from perceptual processing itself, offering fresh insight into symbol grounding, interpretability, and the perceptual roots of language-like organization in artificial systems.

Abstract

As a core element of culture, images transform perception into structured representations and undergo evolution similar to natural languages. Given that visual input accounts for 60% of human sensory experience, it is natural to ask whether images follow statistical regularities similar to those in linguistic systems. Guided by symbol-grounding theory, which posits that meaningful symbols originate from perception, we treat images as vision-centric artifacts and employ pre-trained neural networks to model visual processing. By detecting kernel activations and extracting pixels, we obtain text-like units, which reveal that these image-derived representations adhere to statistical laws such as Zipf's, Heaps', and Benford's laws, analogous to linguistic data. Notably, these statistical regularities emerge spontaneously, without the need for explicit symbols or hybrid architectures. Our results indicate that connectionist networks can automatically develop structured, quasi-symbolic units through perceptual processing alone, suggesting that text- and symbol-like properties can naturally emerge from neural networks and providing a novel perspective for interpretation.

Spontaneous emergence of linguistic statistical laws in images via artificial neural networks

TL;DR

This work investigates whether images processed by vision-focused neural networks spontaneously develop language-like statistical structures. By treating convolutional kernels as visual words and counting highly active pixels, the authors show Zipf's law, Heaps' law, and Benford's law emerge in image-derived representations across multiple datasets and architectures, without explicit symbolic labeling. The analysis reveals that these laws are robust to various perturbations, though susceptibility varies by perturbation type and network design, with Benford's law showing notable resilience. The findings suggest that quasi-symbolic structures can arise from perceptual processing itself, offering fresh insight into symbol grounding, interpretability, and the perceptual roots of language-like organization in artificial systems.

Abstract

As a core element of culture, images transform perception into structured representations and undergo evolution similar to natural languages. Given that visual input accounts for 60% of human sensory experience, it is natural to ask whether images follow statistical regularities similar to those in linguistic systems. Guided by symbol-grounding theory, which posits that meaningful symbols originate from perception, we treat images as vision-centric artifacts and employ pre-trained neural networks to model visual processing. By detecting kernel activations and extracting pixels, we obtain text-like units, which reveal that these image-derived representations adhere to statistical laws such as Zipf's, Heaps', and Benford's laws, analogous to linguistic data. Notably, these statistical regularities emerge spontaneously, without the need for explicit symbols or hybrid architectures. Our results indicate that connectionist networks can automatically develop structured, quasi-symbolic units through perceptual processing alone, suggesting that text- and symbol-like properties can naturally emerge from neural networks and providing a novel perspective for interpretation.

Paper Structure

This paper contains 13 sections, 10 figures.

Figures (10)

  • Figure 1: Three laws in statistical linguistics emerging in images and databases. In (a), we used a landscape of Taiwan photographed by Wei-Hsiung Huang (foto WH) and extracted a 224$\times$224 RGB Region of Interest (ROI). This ROI was then fed into the pre-trained CNN VGG-19, resulting in the emergence of Zipf’s, Heaps’, and Benford’s laws, shown respectively in blue, orange, and green. (b) illustrates the surface-texture characteristics of seven image databases, which we define as the experimental conditions. (c) shows the R-squared results by inputting 16 images from each of the seven conditions into our nine pre-trained CNNs. The color scheme is the same as in (a). R-squared values above 0.93 suggest that the regression lines represent the data well.
  • Figure 2: Zipf’s law under different input conditions in Pre-CNNs. The legend on the upper right defines different Pre-CNNs with the preceding number representing the number of convolutional layers. (a) Zipf's distributions under different RMSE levels. (b) Average RMSE of nine Pre-CNNs across seven conditions. (c) The performance of the Pre-CNNs in (b) clusters into four groups, suggesting shared feature extraction strategies despite their differences in architecture. (d) Visual order parameters: mean, variance, skewness, and kurtosis were averaged across images for each condition. Pearson correlations with model RMSEs reveal which visual statistics each group emphasizes.
  • Figure 3: Heaps’ law under different conditions. (a) Distributions of Heaps’ law under different RMSE thresholds. (b) Performance of Heaps’ law following the original front-to-back input order. (c) Proportion of cases with RMSE $< 0.02$ across 1,000 random permutations of feature map order. (d) Same as (c) except RMSE $< 0.01$.
  • Figure 4: Word-Position correlation in ResNet-18. (a) Original landscape image of Taiwan was authorized by Wei-Hsiung Huang (foto WH). (b) Pearson correlation is used to compute the relationship between the positions of word and each feature map activation, followed by segmentation with a 0.9 threshold. Correlated pixels primarily form small regions, reflecting the Zipf’s law that small regions constitute the main semantic components of the image. (c) To visualize the segmented correlated regions, four statistical order parameters are computed for each RGB channel, yielding 12 features per region. From the initial 4,800 feature maps, salient regions are selected and aggregated into 72 features, which are then clustered into 22 groups. The segmentation map shows the correspondence between these clusters and the original pixels from (b).
  • Figure 5: Performance of Benford’s Law in Pre-CNNs.(a) R-squared values of all Pre-CNN models under nine experimental conditions. (b) Layer-wise proportion of the nine leading digits, averaged over 144 image inputs across nine conditions. (c) Average layer positions of the leading digits based on the same setting as (b). (d) These positions are further grouped into early, middle, and late stages using four layer partitions.
  • ...and 5 more figures