Table of Contents
Fetching ...

Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding

Alessandro Achille, Greg Ver Steeg, Tian Yu Liu, Matthew Trager, Carson Klingenberg, Stefano Soatto

TL;DR

This work defines Complexity-Constrained Descriptive Autoencoding (CC:DAE) to measure conceptual similarity between data by comparing how well optimal, complexity-bounded text descriptions describe each sample. By replacing exact discrimination with a stochastic relaxation over descriptions and evaluating area under the resulting distance curve, CC:DAE yields human-aligned similarity scores and interpretable explanations of what differentiates or links two samples. The method demonstrates state-of-the-art alignment on text and image similarity benchmarks, including cross-modal tasks, without requiring fine-tuning on human scores. It emphasizes the role of language-based explanations and capacity constraints to separate structural semantic information from incidental details, while offering practical computation via encoder models and importance sampling. The framework is extensible to multiple modalities and prompts, with future work envisioned to integrate richer visual descriptors alongside textual explanations.

Abstract

Quantifying the degree of similarity between images is a key copyright issue for image-based machine learning. In legal doctrine however, determining the degree of similarity between works requires subjective analysis, and fact-finders (judges and juries) can demonstrate considerable variability in these subjective judgement calls. Images that are structurally similar can be deemed dissimilar, whereas images of completely different scenes can be deemed similar enough to support a claim of copying. We seek to define and compute a notion of "conceptual similarity" among images that captures high-level relations even among images that do not share repeated elements or visually similar components. The idea is to use a base multi-modal model to generate "explanations" (captions) of visual data at increasing levels of complexity. Then, similarity can be measured by the length of the caption needed to discriminate between the two images: Two highly dissimilar images can be discriminated early in their description, whereas conceptually dissimilar ones will need more detail to be distinguished. We operationalize this definition and show that it correlates with subjective (averaged human evaluation) assessment, and beats existing baselines on both image-to-image and text-to-text similarity benchmarks. Beyond just providing a number, our method also offers interpretability by pointing to the specific level of granularity of the description where the source data are differentiated.

Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding

TL;DR

This work defines Complexity-Constrained Descriptive Autoencoding (CC:DAE) to measure conceptual similarity between data by comparing how well optimal, complexity-bounded text descriptions describe each sample. By replacing exact discrimination with a stochastic relaxation over descriptions and evaluating area under the resulting distance curve, CC:DAE yields human-aligned similarity scores and interpretable explanations of what differentiates or links two samples. The method demonstrates state-of-the-art alignment on text and image similarity benchmarks, including cross-modal tasks, without requiring fine-tuning on human scores. It emphasizes the role of language-based explanations and capacity constraints to separate structural semantic information from incidental details, while offering practical computation via encoder models and importance sampling. The framework is extensible to multiple modalities and prompts, with future work envisioned to integrate richer visual descriptors alongside textual explanations.

Abstract

Quantifying the degree of similarity between images is a key copyright issue for image-based machine learning. In legal doctrine however, determining the degree of similarity between works requires subjective analysis, and fact-finders (judges and juries) can demonstrate considerable variability in these subjective judgement calls. Images that are structurally similar can be deemed dissimilar, whereas images of completely different scenes can be deemed similar enough to support a claim of copying. We seek to define and compute a notion of "conceptual similarity" among images that captures high-level relations even among images that do not share repeated elements or visually similar components. The idea is to use a base multi-modal model to generate "explanations" (captions) of visual data at increasing levels of complexity. Then, similarity can be measured by the length of the caption needed to discriminate between the two images: Two highly dissimilar images can be discriminated early in their description, whereas conceptually dissimilar ones will need more detail to be distinguished. We operationalize this definition and show that it correlates with subjective (averaged human evaluation) assessment, and beats existing baselines on both image-to-image and text-to-text similarity benchmarks. Beyond just providing a number, our method also offers interpretability by pointing to the specific level of granularity of the description where the source data are differentiated.
Paper Structure (41 sections, 39 equations, 4 figures, 3 tables)

This paper contains 41 sections, 39 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: If we describe each image at increasing levels of complexity (blue and orange text), short descriptions apply equally well to both, as measured by their likelihood. However, as the complexity level of the description increases, a gap emerges between the likelihood under the best common description (grey) and the likelihood under the best individual descriptions (blue and orange). For instance, at $C=36$ the best individual descriptions are "Red Fiat 500 car" and "A Ferrari" whereas the best common description is "Italian car brand" which is not as descriptive. The gap traces two asymmetric curves that measure the conceptual difference between the images at each level of complexity. A single number can be obtained by measuring the area under between curves.
  • Figure 2: These two images have similar art styles, theme, and subject matter. On the other hand, it is difficult to identify specific visual elements that appear in both images. These two images were found to be "substantially similar" steinberg_v_columbia based on the arrangement of similar features in a similar way. How can we measure how similar these images are?
  • Figure 3: Role of prompts. Consider the three images above: which pair is most similar? This depends if we focus on content --- the first two depict Notre-Dame, the third a boat on the Seine --- or on the style/artistic technique --- the first is a photograph, the second and third are paintings in the pointillist style. By changing the prompt ("Describe the style of the image" or "Describe the content of the image"), the user can bias $p(h|x)$ to focus the conceptual distance on one or the other aspect. Note that images A and B are closer under the content prompt, but B and C are closer under the style prompt.
  • Figure 4: Conceptual Similarity is not the same as shared information. Are these two pictures similar? Not according to Normalized Compression Distance, which measures their difference at 97.1% (estimated using JPEG XL lossless compression). However, they share all the structural information --- they are the exact same ink print on a piece of paper. The only difference is the randomness of the paper texture. Most people would not consider it a significant conceptual difference, but since NCD cannot differentiate structure from randomness, this slight change accounts for 97.1% of the difference. This problem is inherent in high-dimensional data where information in random variation overshadows structural information. In fact, on the right we plot the NCD distance as we change the resolution of the image, showing that the distance increases drastically as we increase the dimension of the data.

Theorems & Definitions (2)

  • proof
  • proof