Table of Contents
Fetching ...

Unsupervised detection of semantic correlations in big data

Santiago Acevedo, Alex Rodriguez, Alessandro Laio

TL;DR

A method to estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, when this is of binary type, and is therefore a proxy of semantic complexity.

Abstract

In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.

Unsupervised detection of semantic correlations in big data

TL;DR

A method to estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, when this is of binary type, and is therefore a proxy of semantic complexity.

Abstract

In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.

Paper Structure

This paper contains 21 sections, 17 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: The Binary Intrinsic Dimension (BID) of spin systems as models of interacting bit arrays. Upper row: BID per spin as a function of the total number of spins, $N$, for several temperatures $T$. Central row: Thermal dependence of the BID normalized per spin. $L$ stands for the lattice width (or height). Bottom row: Model validation at $L=100$ and different temperatures. In colored crosses, the empirical probability of Hamming distances $P_{emp}(r)$. In solid black line, our model fit, Eq. \ref{['eq:p-of-d']}. (a,d,g): ferromagnetic Ising model on the square lattice, (b,e,h): the Potts model on the square lattice with $q=8$ states, (c,f,i): the antiferromagnetic Ising model on the triangular lattice. With vertical dashed lines we show the exact results in the thermodynamic limit for the critical temperature, $T_c = 2/\log(1+\sqrt{2})$ in panel d) and the transition temperature $T^*=1/\ln{(1+\sqrt{q})}\approx 0.745$ in panel e).
  • Figure 2: The Binary Intrinsic Dimension (BID) of image-crop representations in Resnet18. a) Example of crops for a sample of ImageNet. On top of each image, we report the size of the crop. All images are resized to $224 \times 224$ resolution after cropping, thus the rightmost image is the full image. Note moreover that the crop is performed only along the spatial dimensions, keeping the three channels of the images. b): a set of images constructed from random patches of size $28 \times 28$ from ImageNet samples. c), d), e), f) panels: BID of image representations as a function of the number of pixels in the cropped image, $N_{crop}$, where we normalized the latter by dividing by the total number of pixels in the full image, $N_{tot}=224^2$. For each panel, $l$ is the layer index and $L_R$ is the total number of layers. For further details see Methods, section 'The BID of image representations'.
  • Figure 3: The Binary Intrinsic Dimension (BID) of text representations in large language models. Upper boxes: an example of a Wikitext sentence truncated at different token lengths, $T$. Central panel: BID per bit calculated using all token representations in the sequence as a function of the number of tokens, for Pythia410m (see Methods, section 'The BID of text representations' for details). $l=0$ stands for the embedding layer, $l=24$ corresponds to the last transformer layer before the head. "Concat" stands for phrases constructed concatenating 150 tokens from two randomly selected data samples. The dashed line corresponds to the semantic boundary between the two phrases.
  • Figure 4: Comparison between Binary Intrinsic Dimension (BID) and other real-space estimators. Panel a): Thermal dependence of the intrinsic dimension for a small $30 \times 30$ ferromagnetic Ising model. Panel b) scaling of the intrinsic dimension normalized per bit at temperature $T=1.9$. Number of samples: 5000. GRIDE stands for the Generalized Ratios Intrinsic Dimension Estimatordenti2022generalized. FCI stands for the Full Correlation Integral estimatorerba2019intrinsic. CD stands for Correlation Integralgrassberger1983measuring. MLE stands for the Maximum Likelihood Estimator of Ref. levina2004maximum. I3D stands for the Intrinsic Dimension Estimator for Discrete Datasets of Ref. I3D.
  • Figure 5: BID Bayesian inference. Panels a) and b) show the empirical posterior densities for the two parameters of our model, $d_0$ and $d_1$, respectively. $d_0^*/N=0.0512$ and $d_1^*=1.9079$ correspond to the solution found optimizing Eq. \ref{['eq:p-of-d']}. The posterior means and standard deviations are $\mu_{d_0}/N = 0.0510$, $\sigma_{d_0}/N=0.0006$, $\mu_{d_1}=1.908$, $\sigma_{d_1}=0.002$. Number of samples: $N_s = 500$, number of spins $N=10^4$, temperature $T=2.3.$
  • ...and 3 more figures