A quantitative analysis of semantic information in deep representations of text and images
Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio
TL;DR
The paper introduces Information Imbalance II, a directional, neighborhood-based metric to quantify how much semantic information is shared across deep representations of semantically related data. By applying this method to translations and image-caption pairs, it localizes semantic content to mid-to-deep layers and shows that semantic information is distributed across many tokens with long-range correlations. It also reveals cross-modal alignment between caption and image representations, with asymmetries reflecting architecture and training objectives. Across large language and vision transformers, the work supports the Platonic hypothesis that better models converge toward shared semantic representations, providing a quantitative map of where meaning resides in multimodal deep networks.
Abstract
Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.
