Table of Contents
Fetching ...

A quantitative analysis of semantic information in deep representations of text and images

Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio

TL;DR

The paper introduces Information Imbalance II, a directional, neighborhood-based metric to quantify how much semantic information is shared across deep representations of semantically related data. By applying this method to translations and image-caption pairs, it localizes semantic content to mid-to-deep layers and shows that semantic information is distributed across many tokens with long-range correlations. It also reveals cross-modal alignment between caption and image representations, with asymmetries reflecting architecture and training objectives. Across large language and vision transformers, the work supports the Platonic hypothesis that better models converge toward shared semantic representations, providing a quantitative map of where meaning resides in multimodal deep networks.

Abstract

Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

A quantitative analysis of semantic information in deep representations of text and images

TL;DR

The paper introduces Information Imbalance II, a directional, neighborhood-based metric to quantify how much semantic information is shared across deep representations of semantically related data. By applying this method to translations and image-caption pairs, it localizes semantic content to mid-to-deep layers and shows that semantic information is distributed across many tokens with long-range correlations. It also reveals cross-modal alignment between caption and image representations, with asymmetries reflecting architecture and training objectives. Across large language and vision transformers, the work supports the Platonic hypothesis that better models converge toward shared semantic representations, providing a quantitative map of where meaning resides in multimodal deep networks.

Abstract

Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

Paper Structure

This paper contains 26 sections, 2 equations, 14 figures.

Figures (14)

  • Figure 1: Left: Information Imbalance $\Delta(X\!\to\!Y)$ and $\Delta(Y\!\to\!X)$, compared with Central Kernel Alignment (CKA), for a synthetic Gaussian construction in which each index $r$ generates a pair $(X_r, Y_r)$ via $Y_r = B_r X_r + \varepsilon$, with $X_r \sim \mathcal{N}(0, I)$ and $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, in $D = 10$ dimensions. The matrices $B_r \in \mathbb{R}^{D \times D}$ are designed to have monotonically increasing rank—from rank one at $r{=}1$ to full rank at the final index. Right: Statistical-power benchmark on a high-dimensional Gaussian model. We compute the Information Imbalance and CKA using only a fraction of the $p$ components, ranging from small subsets to the full vector for $p=10^3$ and $p=10^5$. In both figures, we report the standard error computed by averaging over ten jackknife repetitions.
  • Figure 2: Panel a) Information Imbalance from English to Spanish, using representations generated at equal depth from translated sentences of opus books, as a function of the relative network depth, for DeepSeek-V3, BERT-multilingual and Llama3 with 1, 3, and 8 billion parameters. A smaller value of the Information Imbalance correspond to higher predictive power. We used the concatenation of the last $20$ tokens for the computation. Panel b) Information Imbalance between equal-layer representations of DeepSeek-V3 processing different translation pairs. For each language pair in the legend, we show the Information Imbalance from the first language to the second. The Jackknife error bars (five repetitions subsampling 2,000 of 5,000 sentences) are too small to show; the marker sizes act as an upper bound on these errors.
  • Figure 3: Panel a) Minimum Information Imbalance across depth, from English to Spanish representations, as a function of the number of tokens used in the computation, for DeepSeek-V3 and Llama3.1-8b, when considering shorter (40 to 80 tokens, in circles) or longer (100 to 200 tokens, in triangles) sentences. Panels b) and c) Information Imbalance from the last token to a previous token at token-distance $\tau$, using English sentences, computed on the representations generated by DeepSeek-V3 in panel a) and by Llama3.1-8b in panel b), in different layers. Long-distance token-token correlations are maximal (Information Imbalance increases most slowly) for representations on the inner semantic layers (43 for DeepSeek-V3, 23 for Llama3.1-8b), and the effect is dramatically stronger in DeepSeek-V3. The Jackknife error bars (five repetitions subsampling 2,000 of 5,000 sentences) are too small to show; the marker sizes act as an upper bound on these errors.
  • Figure 4: Panel a) Information Imbalance for image pairs from the Imagenet1k dataset. We sample 2500 pairs of images from the same class at random; averaging over five replications. We report the mean and standard deviation. Panel b) Information Imbalance between image and caption representations on the flickr30k dataset. Images are encoded using DinoV2 and image-gpt-large, while captions are processed with DeepSeek-V3. We report the imbalance as a function of the relative depth in the vision transformer, using the 52nd layer of DeepSeek-V3 (inside the semantic region) for captions. Results are averaged over 5000 samples, with uncertainty estimated via bootstrapping (100 replicas of 200 samples each). Dotted lines indicate information flow from caption to image; dashed lines, from image to caption.
  • Figure 5: Information Imbalance between English (en) and Spanish (es) representations generated by DeepSeek-V3's as a function of depth. 'Concat' stands for the results obtain with the concatenation of the 20 last tokens of each sentence. 'Average' stands for the average of the same 20 tokens. The standard deviation is computed with a Jackknife procedure, subsampling 2000 samples out of 5000 five times, and it is smaller than marker size.
  • ...and 9 more figures