Table of Contents
Fetching ...

Statistical Mechanics of Semantic Compression

Tankut Can

TL;DR

This work frames semantic compression as a combinatorial optimization in a high-dimensional Euclidean semantic space, where the meaning of a message is captured by a bag-of-words embedding and distortion is the Euclidean distance between embeddings. It develops a spin-glass Hamiltonian formulation and analyzes it with replica-symmetric mean-field theory to reveal a phase diagram featuring a first-order lossy-to-lossless transition and a crossover between extractive and abstractive compression. The study shows that, for random embeddings, near-optimal, typical-case compressions can be achieved efficiently with greedy algorithms, and that the lossy regime is well captured by RS theory while the lossless transition presents breakdowns likely due to replica-symmetry limitations. These results connect semantic representations to compression performance, offering insights into how meaning, efficiency, and paraphrasing emerge in both human communication and AI systems.

Abstract

The basic problem of semantic compression is to minimize the length of a message while preserving its meaning. This differs from classical notions of compression in that the distortion is not measured directly at the level of bits, but rather in an abstract semantic space. In order to make this precise, we take inspiration from cognitive neuroscience and machine learning and model semantic space as a continuous Euclidean vector space. In such a space, stimuli like speech, images, or even ideas, are mapped to high-dimensional real vectors, and the location of these embeddings determines their meaning relative to other embeddings. This suggests that a natural metric for semantic similarity is just the Euclidean distance, which is what we use in this work. We map the optimization problem of determining the minimal-length, meaning-preserving message to a spin glass Hamiltonian and solve the resulting statistical mechanics problem using replica theory. We map out the replica symmetric phase diagram, identifying distinct phases of semantic compression: a first-order transition occurs between lossy and lossless compression, whereas a continuous crossover is seen from extractive to abstractive compression. We conclude by showing numerical simulations of compressions obtained by simulated annealing and greedy algorithms, and argue that while the problem of finding a meaning-preserving compression is computationally hard in the worst case, there exist efficient algorithms which achieve near optimal performance in the typical case.

Statistical Mechanics of Semantic Compression

TL;DR

This work frames semantic compression as a combinatorial optimization in a high-dimensional Euclidean semantic space, where the meaning of a message is captured by a bag-of-words embedding and distortion is the Euclidean distance between embeddings. It develops a spin-glass Hamiltonian formulation and analyzes it with replica-symmetric mean-field theory to reveal a phase diagram featuring a first-order lossy-to-lossless transition and a crossover between extractive and abstractive compression. The study shows that, for random embeddings, near-optimal, typical-case compressions can be achieved efficiently with greedy algorithms, and that the lossy regime is well captured by RS theory while the lossless transition presents breakdowns likely due to replica-symmetry limitations. These results connect semantic representations to compression performance, offering insights into how meaning, efficiency, and paraphrasing emerge in both human communication and AI systems.

Abstract

The basic problem of semantic compression is to minimize the length of a message while preserving its meaning. This differs from classical notions of compression in that the distortion is not measured directly at the level of bits, but rather in an abstract semantic space. In order to make this precise, we take inspiration from cognitive neuroscience and machine learning and model semantic space as a continuous Euclidean vector space. In such a space, stimuli like speech, images, or even ideas, are mapped to high-dimensional real vectors, and the location of these embeddings determines their meaning relative to other embeddings. This suggests that a natural metric for semantic similarity is just the Euclidean distance, which is what we use in this work. We map the optimization problem of determining the minimal-length, meaning-preserving message to a spin glass Hamiltonian and solve the resulting statistical mechanics problem using replica theory. We map out the replica symmetric phase diagram, identifying distinct phases of semantic compression: a first-order transition occurs between lossy and lossless compression, whereas a continuous crossover is seen from extractive to abstractive compression. We conclude by showing numerical simulations of compressions obtained by simulated annealing and greedy algorithms, and argue that while the problem of finding a meaning-preserving compression is computationally hard in the worst case, there exist efficient algorithms which achieve near optimal performance in the typical case.

Paper Structure

This paper contains 21 sections, 109 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Semantic Compression Phase Diagram and Order ParametersA) The zero temperature phase diagram for the RS order parameters fixing $\ell = 0.4$. The discontinuous transition is indicated by the thick black line. In the lossless, compressible phase (blue region), the EA order parameter $Q > 0$, and the RS MFT has a unique solution with zero mean distortion. Within the lossy phase, where the EA order parameter $Q = 0$, there is a region (shaded orange, enclosed by black dashed curves) in which the RS MFT has multiple solutions. Outside of this region (white area), the RS MFT has a unique solution. Colored dashed lines show the slice along which the order parameters are computed in the other panels. B) Average distortion (at zero temperature) normalized by its value $\alpha \ell/2$ at $C = 0$. Outside the lossless phase, the distortion never reaches zero, but generally decreases with compression ratio. For the green curve, the value of $\alpha \approx .251$ is chosen such that the distortion hits zero at exactly $C = 0.6$. C) The overlap order parameters in the lossy phase. For reference, we show the Hamming compression limit (red dashed) in which $Q = 0$ and $R = \bar{\ell}$. For $\alpha = 0.5$ (solid blue), there is a bifurcation in the RS MFT at $C \approx 0.94$, above which two new solutions appear which are much closer to the Hamming limit. For larger $\alpha$, the overlap is close to the Hamming line for all compression ratios. D) Overlap and EA order parameter for $\alpha \approx 0.251$. There are no $Q = 0$ solutions to the order parameters in the region between the two vertical lines at $C = 0.6$ (solid black) and $C \approx 0.98$ (dotted black). However, $Q > 0$ solutions actually appear at compression ratios slightly smaller than $C = 0.6$. There is no reason to prefer one of these over the other in the RS theory, since they both have negative entropy. Furthermore, the $Q>0$ solutions survive until $C = 1$. However, approaching $C = 1$ the RS MFT does yield a sensible physical solution, which we believe takes over for larger compression ratios.
  • Figure 2: Numerics Comparing numerical optimization via simulated annealing (SA) and a greedy algorithm (GA), with RS MFT. A) The SA values (crosses) of the order parameters diverge from RS MFT prediction (solid curves) in the regime we expect to see the phase transition to lossless and abstractive compression (for the green points, past $C = 0.6$). There is also a striking deviation for the blue ($\alpha = 0.5$) for intermediate $C$, but agreement for small $C$ and $C \to 1$. Curiously, when disagreement between RS MFT and SA is large, the GA finds solutions with an order parameter that is very close to that predicted by the RS MFT. B) The numerically computed EA order parameter shows a smooth rise and fall, in contrast to the theoretical prediction. Since the GA produces a unique minimizer for a given $c$, there is no comparison to be made here with SA, which generally finds multiple minimizers for a given quenched message. C) The average distortion is fairly well described by theory, with notable deviations for larger compression ratios, as in the previous plots. We used $N = 200$ and average over $10$ embedding (disorder) realizations. The order parameters are computed by estimating the low energy spectrum of the Hamiltonian and computing a truncated temperature average (see Appendix \ref{['sec:numerics']} for details)
  • Figure 3: Phase diagram dependence on message length(Left) The zero temperature phase diagram of the RS MFT for different values of relative message length $\ell$. In each figure, we plot only the curves demarcating the compressible phase $Q>0$. The solid curve indicates the discontinuous transition, whereas the dashed curve corresponds the appearance of self-consistent incompressible ($Q = 0$) solutions. For each fixed $\ell$, the compressible region extends out to some maximal $\alpha$ corresponding to the point where these two curves meet (denoted by a star in the figure). (Right) shows that the maximal $\alpha$ depends non-monotonically on $\ell$, and tends to zero both as $\ell \to 0$ and $\ell \to 1$.