Table of Contents
Fetching ...

Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration

Kfir Goldberg, Elad Richardson, Yael Vinker

TL;DR

Inspiration Seeds addresses the early, visual phase of ideation by producing non-literal, visually coherent combinations from two input images without textual prompts. It leverages CLIP Sparse Autoencoders to obtain two dominant visual aspects and learns an inverse mapping via a fine-tuned diffusion-editing backbone to generate multiple I_comb variants. The approach automatically builds training data through implicit decomposition, enabling open-ended visual recombination and a description-complexity evaluation that correlates with perceived non-triviality. The results demonstrate richer, more imaginative visual connections than baselines, with potential to accelerate creative exploration in design workflows.

Abstract

While generative models have become powerful tools for image synthesis, they are typically optimized for executing carefully crafted textual prompts, offering limited support for the open-ended visual exploration that often precedes idea formation. In contrast, designers frequently draw inspiration from loosely connected visual references, seeking emergent connections that spark new ideas. We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation. Given two input images, our model produces diverse, visually coherent compositions that reveal latent relationships between inputs, without relying on user-specified text prompts. Our approach is feed-forward, trained on synthetic triplets of decomposed visual aspects derived entirely through visual means: we use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs. By removing the reliance on language and enabling fast, intuitive recombination, our method supports visual ideation at the early and ambiguous stages of creative work.

Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration

TL;DR

Inspiration Seeds addresses the early, visual phase of ideation by producing non-literal, visually coherent combinations from two input images without textual prompts. It leverages CLIP Sparse Autoencoders to obtain two dominant visual aspects and learns an inverse mapping via a fine-tuned diffusion-editing backbone to generate multiple I_comb variants. The approach automatically builds training data through implicit decomposition, enabling open-ended visual recombination and a description-complexity evaluation that correlates with perceived non-triviality. The results demonstrate richer, more imaginative visual connections than baselines, with potential to accelerate creative exploration in design workflows.

Abstract

While generative models have become powerful tools for image synthesis, they are typically optimized for executing carefully crafted textual prompts, offering limited support for the open-ended visual exploration that often precedes idea formation. In contrast, designers frequently draw inspiration from loosely connected visual references, seeking emergent connections that spark new ideas. We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation. Given two input images, our model produces diverse, visually coherent compositions that reveal latent relationships between inputs, without relying on user-specified text prompts. Our approach is feed-forward, trained on synthetic triplets of decomposed visual aspects derived entirely through visual means: we use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs. By removing the reliance on language and enabling fast, intuitive recombination, our method supports visual ideation at the early and ambiguous stages of creative work.
Paper Structure (29 sections, 3 equations, 28 figures, 2 tables)

This paper contains 29 sections, 3 equations, 28 figures, 2 tables.

Figures (28)

  • Figure 1: Dresses from Iris van Herpen's Sensory Seas collection (2020), inspired by a resemblance between deep-sea hydrozoans and neural structures. Surfacing such unique connections is key to producing original designs.
  • Figure 2: Trivial vs. non-trivial visual combinations. Given a leaf and a portrait, Nano Banana produces a trivial combination by replacing the earring with a leaf. Our method surfaces deeper connections: the leaf's decay pattern appears in the skin, and its aged quality carries over to the subject.
  • Figure 3: Overview of our image decomposition pipeline. Given an image $I_{comb}$, we encode it via CLIP and pass the embedding through an SAE encoder $W_{enc}$. We retain the top-k activations as one-hot vectors, and decode them back to CLIP space via $W_{dec}$. We then cluster the resulting vectors into two groups using k-means. The editing direction $v_{A\to B}$ is computed as the difference between cluster centroids. Moving $e_{comb}$ in opposite directions along this axis and decoding via Kandinsky yields two images $I_A$ and $I_B$ that emphasize distinct visual aspects of the original image.
  • Figure 4: Examples of decomposed triplets. Each row shows two variations $(I_A, I_B, I_{comb})$ derived from a source image $I_{comb}$ using our CLIP SAE decomposition. The decomposition separates distinct visual aspects.
  • Figure 5: Visual Combinations under different seeds. For the same pair of input images our model can produce different visual combinations just by varying the seed, without any explicit guidance.
  • ...and 23 more figures