Table of Contents
Fetching ...

Vibe Spaces for Creatively Connecting and Expressing Visual Concepts

Huzheng Yang, Katherine Xu, Andrew Lu, Michael D. Grossberg, Yutong Bai, Jianbo Shi

TL;DR

This paper tackles creative visual blending by identifying and merging the most relevant shared attributes—'vibes'—between images. It introduces Vibe Space, a hierarchical graph manifold learned on a multiscale diffusion framework to produce non-linear geodesics in ambient feature spaces like CLIP, enabling coherent Vibe Blending and Vibe Analogy. The authors develop a cognitively inspired evaluation framework combining human judgments, LLM reasoning, and a path nonlinearity score (PNS) to measure blend creativity and difficulty, demonstrating superior creativity and coherence over strong baselines on challenging pairs. They also propose mechanisms for creative control, extrapolation, and negative vibe suppression, offering practical tools for controllable, image-conditioned creative synthesis with efficient training and inference.

Abstract

Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes -- their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.

Vibe Spaces for Creatively Connecting and Expressing Visual Concepts

TL;DR

This paper tackles creative visual blending by identifying and merging the most relevant shared attributes—'vibes'—between images. It introduces Vibe Space, a hierarchical graph manifold learned on a multiscale diffusion framework to produce non-linear geodesics in ambient feature spaces like CLIP, enabling coherent Vibe Blending and Vibe Analogy. The authors develop a cognitively inspired evaluation framework combining human judgments, LLM reasoning, and a path nonlinearity score (PNS) to measure blend creativity and difficulty, demonstrating superior creativity and coherence over strong baselines on challenging pairs. They also propose mechanisms for creative control, extrapolation, and negative vibe suppression, offering practical tools for controllable, image-conditioned creative synthesis with efficient training and inference.

Abstract

Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes -- their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.

Paper Structure

This paper contains 63 sections, 36 equations, 34 figures, 6 tables, 5 algorithms.

Figures (34)

  • Figure 1: What are the most relevant attributes for blending the violin player and guitar player? It is the instrument and how it is played, not the color or background. We call these attributes the "vibe". Recent diffusion-based morphing methods such as Yu et al. yu2025probability and LLMs like GPT gpt-image-1 and Gemini comanici2025gemini struggle to blend the vibe, instead interpolating pixels, transferring style, or composing parts. We propose Vibe Space for identifying the vibe between input images and generating coherent, continuous blends that merge the vibe (Vibe Blending). With the discovered vibe, we can extrapolate to nontrivial but related concepts, such as Hilary Hahn playing guitar (Vibe Analogy).
  • Figure 2: Our method generates coherent blends that focus on the most relevant attributes shared by the input images---the hairstyle. Diffusion-based morphing methods like DiffMorpher zhang2024diffmorpher and Yu et al. yu2025probability struggle to produce realistic blends of distant concepts, and AID he2024aid fails to capture hairstyle as the relevant attribute.
  • Figure 3: Top: Using a 2D point cloud example, the forward mapping involves computing the (a.) affinity graph $\mathbf{W}$ of the points $\mathbf{x}$ on the manifold and (b.) generalized eigenvectors $\mathbf{\Psi}(\mathbf{x})$ of the graph Laplacian $\mathbf{L}$ as manifold coordinates of the point cloud $\mathbf{x}$. (c.) The inverse mapping performs linear interpolation in the manifold space and uses graph diffusion inversion to obtain the corresponding path in the original point cloud space. Bottom:(d.) On real images, we extract patch tokens from DINO features as graph nodes and compute token-wise affinity $\mathbf{W}$. (e.) The top $m$ graph eigenvectors produce co-salient segments across two images for blending, and manifold coordinates for expressing "vibe" features for blending. (f.) Similar to the point cloud example, we apply graph diffusion inversion to obtain a path in CLIP space. We "render" pixel images from CLIP features using a frozen IP-Adapter ye2023ip. We train two lightweight MLP networks in under 30 seconds: an encoder to simulate and compress the forward mapping, and a decoder to mimic the inverse mapping.
  • Figure 4: From Vibe Blending to Vibe Analogy. Vibe Space enables creative connections between input images $A$ and $B$. A path approximately linear in Vibe Space results in a continuous manifold-following path in the ambient feature space, such as CLIP. We can lift the "vibe" $\mathbf{\Delta}_{A\to B}$ to a non-trivial but related image $A'$ to extrapolate an analogous path in the ambient space, resulting in image $B'$ that reflects the same vibe. For example, we can morph Leonardo DiCaprio's face into a playing card.
  • Figure 5: Negative vibe control. Vibe attributes are implicitly extracted by Vibe Space. The blending pair defines desired vibes (rotation + style). The negative pair defines vibes to suppress (style). Blending without negative examples transfers both attributes. Subtracting the negative vibe, only rotation is blended.
  • ...and 29 more figures