Table of Contents
Fetching ...

Relational Visual Similarity

Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li

TL;DR

The paper introduces relational visual similarity, a measure that captures abstract relational logic between images beyond surface attributes. It builds a relational dataset by grouping images by underlying relational patterns and generating anonymous captions describing those patterns, then trains a vision-language model to align image embeddings with these captions using a contrastive objective. Relis m outperforms traditional attribute-based similarity metrics and caption-only baselines in relational image retrieval, with human studies confirming closer alignment to human judgments. The work demonstrates practical applications in relational retrieval and analogical image generation, while acknowledging limitations in dataset scale and potential biases. Overall, it reveals a crucial, previously underexplored dimension of visual similarity and lays groundwork for future relational understanding in vision systems.

Abstract

Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.

Relational Visual Similarity

TL;DR

The paper introduces relational visual similarity, a measure that captures abstract relational logic between images beyond surface attributes. It builds a relational dataset by grouping images by underlying relational patterns and generating anonymous captions describing those patterns, then trains a vision-language model to align image embeddings with these captions using a contrastive objective. Relis m outperforms traditional attribute-based similarity metrics and caption-only baselines in relational image retrieval, with human studies confirming closer alignment to human judgments. The work demonstrates practical applications in relational retrieval and analogical image generation, while acknowledging limitations in dataset scale and potential biases. Overall, it reveals a crucial, previously underexplored dimension of visual similarity and lays groundwork for future relational understanding in vision systems.

Abstract

Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.

Paper Structure

This paper contains 12 sections, 5 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Would you say images in Group A are similar to the Reference Image? Current state-of-the-art image similarity models (e.g., LPIPS lpips, CLIP clip) would answer no. These models would say only Group B are similar to the reference image, as they equate similarity with a high degree of shared perceptual attribute features (i.e., color, shape, semantic class). However, as humans, we would confidently say yes—images in both groups are similar to the reference. While Group B is similar in perceptual attributes, Group A is similar in a more abstract, relational sense (e.g., "transformation of {subject} through time", first row). In this paper, we propose to model this missing dimension of visual similarity, or called relational visual similarity, capturing human-like reasoning over relational structures.
  • Figure 2: Overall pipeline. (a) We train an image filtering model to select high-quality relational images from LAION-2B laion5b. (b) Anonymous captioning model is trained on groups of images that share the same underlying logic, pairing all images in each group with the same anonymous caption. (c) Training relational visual similarity (relsim) model involves a contrastive loss between image features and their corresponding anonymous captions.
  • Figure 3: Examples of relationally interesting vs. ordinary images.
  • Figure 4: Writing an anonymous caption is hard from a single image, but easier with an image group where the pattern is clear.
  • Figure 5: Attributes vs. Relational Visual Image Retrieval. Visualization of nearest neighbor using different visual similarity metrics. As can be seen, only ours understands and can detect the relational similarity.
  • ...and 10 more figures