Table of Contents
Fetching ...

Conjuring Semantic Similarity

Tian Yu Liu, Stefano Soatto

TL;DR

The semantic similarity between two textual expressions is characterized simply as the distance between image distributions they induce, or 'conjure,' and it is shown that by choosing the Jensen-Shannon divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling.

Abstract

The semantic similarity between sample expressions measures the distance between their latent 'meaning'. Such meanings are themselves typically represented by textual expressions, often insufficient to differentiate concepts at fine granularity. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jensen-Shannon divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.

Conjuring Semantic Similarity

TL;DR

The semantic similarity between two textual expressions is characterized simply as the distance between image distributions they induce, or 'conjure,' and it is shown that by choosing the Jensen-Shannon divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling.

Abstract

The semantic similarity between sample expressions measures the distance between their latent 'meaning'. Such meanings are themselves typically represented by textual expressions, often insufficient to differentiate concepts at fine granularity. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jensen-Shannon divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.

Paper Structure

This paper contains 19 sections, 8 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: We illustrate the process of conjuring semantic similarity between textual expressions "Snow Leopard" and "Bengal Tiger". We denoise each sequence of noisy images (middle row of both halves of figure) with both prompts (top and bottom row of both halves of figure). Our method can be interpreted as taking the Euclidean distance between the resulting images in the two rows. The sequences of noisy images are obtained with either of the two text expressions (top / bottom halves of Figure) starting from a Gaussian prior ($t=T$). Observing cells highlighted in red, we see that the model converts pictures of Snow Leopards into Bengal Tigers by changing their characteristic spotted coats into stripes, and adding striped textures to the animal's face (top half of Figure), and conversely converts Bengal Tigers into Snow Leopards by changing their characteristic stripes into spotted coats (bottom half of Figure). This enables interpretability of their semantic differences via changes in their evoked imageries.
  • Figure 2: Qualitative evaluation of conjured semantic similarity. (Left) shows that nouns cluster based on shared hypernym classes: Dogs (puppy, poodle, dalmatian, pug) form a visible cluster in the top-left 4x4 block, while marine animals (whale, shark, dolphin, sealion) form another cluster in the bottom-right 4x4 block. (Right) shows that the same pattern holds for flying-related action verbs (elevate, ascend, soar, glide) v.s. negative stative verbs (disappoint, grieve, worry, regret).
  • Figure 3: (Left:) We ablate over different choices of priors over timesteps -- a uniform distribution over timesteps $\{T', \ldots, T\}$ where $T' \leq T=10$, represented by the blue line (cumulative), and the Direc Delta on any particular timestep $T' \in \{1, \ldots, T\}$, represented by the orange line (pointwise). We show that a uniform prior over all timesteps gives the best results. The same plot also ablates over the number of Monte-Carlo samples, $k \in \{1, \ldots, 5\}$, where we conclude that only few iterations are required for convergence. (Right:) We further ablate over different choices of diffusion models, and show that results remain relatively consistent across the tested choices.
  • Figure 4: "Merlion" vs "Mermaid Lion": While both prompts express compositions of the same set of objects, the model associates different meanings with "Merlion" as opposed to "Mermaid + Lion", where the former is associated to the mascot of Singapore, while the latter is a mermaid with hair resembling a lion's mane.
  • Figure 5: "Bag of Chips" vs "Bag of Fries": The interpretation of "chips" depends on cultural background (US vs UK), but the interpretation of "fries" is relatively non-ambiguous. Interestingly, this observation can be visualized when computing semantic similarity with our method. We see that on the left of the figure (second image column), the model attempts to convert a picture of chips (US) into fries by changing the rounded textures into sharper rectangular ones, when denoised with "Bag of Fries". On the other hand, pictures of fries still remain relatively identifiable as fries (fifth image column) when denoised using "Bag of Chips".