Table of Contents
Fetching ...

Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

Andrew Fraser

TL;DR

It is established that morphological structure -- whether in feature descriptors or prompt-level phonological form -- creates systematic navigational gradients through diffusion model latent spaces through diffusion model latent spaces.

Abstract

We demonstrate that morphological pressure creates navigable gradients at multiple levels of the text-to-image generative pipeline. In Study~1, identity basins in Stable Diffusion 1.5 can be navigated using morphological descriptors -- constituent features like platinum blonde,'' beauty mark,'' and 1950s glamour'' -- without the target's name or photographs. A self-distillation loop (generating synthetic images from descriptor prompts, then training a LoRA on those outputs) achieves consistent convergence toward a specific identity as measured by ArcFace similarity. The trained LoRA creates a local coordinate system shaping not only the target identity but also its inverse: maximal away-conditioning produces eldritch'' structural breakdown in base SD1.5, while the LoRA-equipped model produces ``uncanny valley'' outputs -- coherent but precisely wrong. In Study~2, we extend this to prompt-level morphology. Drawing on phonestheme theory, we generate 200 novel nonsense words from English sound-symbolic clusters (e.g., \emph{cr-}, \emph{sn-}, \emph{-oid}, \emph{-ax}) and find that phonestheme-bearing candidates produce significantly more visually coherent outputs than random controls (mean Purity@1 = 0.371 vs.\ 0.209, p<0.00001p < 0.00001 p<0.00001, Cohen's d=0.55d = 0.55 d=0.55). Three candidates -- \emph{snudgeoid}, \emph{crashax}, and \emph{broomix} -- achieve perfect visual consistency (Purity@1 = 1.0) with zero training data contamination, each generating a distinct, coherent visual identity from phonesthetic structure alone. Together, these studies establish that morphological structure -- whether in feature descriptors or prompt-level phonological form -- creates systematic navigational gradients through diffusion model latent spaces. We document phase transitions in identity basins, CFG-invariant identity stability, and novel visual concepts emerging from sub-lexical sound patterns.

Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

TL;DR

It is established that morphological structure -- whether in feature descriptors or prompt-level phonological form -- creates systematic navigational gradients through diffusion model latent spaces through diffusion model latent spaces.

Abstract

We demonstrate that morphological pressure creates navigable gradients at multiple levels of the text-to-image generative pipeline. In Study~1, identity basins in Stable Diffusion 1.5 can be navigated using morphological descriptors -- constituent features like platinum blonde,'' beauty mark,'' and 1950s glamour'' -- without the target's name or photographs. A self-distillation loop (generating synthetic images from descriptor prompts, then training a LoRA on those outputs) achieves consistent convergence toward a specific identity as measured by ArcFace similarity. The trained LoRA creates a local coordinate system shaping not only the target identity but also its inverse: maximal away-conditioning produces eldritch'' structural breakdown in base SD1.5, while the LoRA-equipped model produces ``uncanny valley'' outputs -- coherent but precisely wrong. In Study~2, we extend this to prompt-level morphology. Drawing on phonestheme theory, we generate 200 novel nonsense words from English sound-symbolic clusters (e.g., \emph{cr-}, \emph{sn-}, \emph{-oid}, \emph{-ax}) and find that phonestheme-bearing candidates produce significantly more visually coherent outputs than random controls (mean Purity@1 = 0.371 vs.\ 0.209, p<0.00001p < 0.00001 p<0.00001, Cohen's d=0.55d = 0.55 d=0.55). Three candidates -- \emph{snudgeoid}, \emph{crashax}, and \emph{broomix} -- achieve perfect visual consistency (Purity@1 = 1.0) with zero training data contamination, each generating a distinct, coherent visual identity from phonesthetic structure alone. Together, these studies establish that morphological structure -- whether in feature descriptors or prompt-level phonological form -- creates systematic navigational gradients through diffusion model latent spaces. We document phase transitions in identity basins, CFG-invariant identity stability, and novel visual concepts emerging from sub-lexical sound patterns.
Paper Structure (48 sections, 11 figures, 10 tables)

This paper contains 48 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Morphological addressing via descriptor intersection. Each natural-language descriptor (e.g., "platinum blonde," "beauty mark," "1950s glamour") defines a region in latent space. Their intersection addresses a specific identity basin without requiring the target's name or reference photographs.
  • Figure 2: Grok name-based generation. Direct prompting with "Marilyn Monroe" produces outputs closely resembling archival photographs.
  • Figure 3: Morphological descriptors---"platinum blonde curled hair, beauty mark, 1950s glamour, white halter dress"---navigate to the same identity basin but generate synthetic outputs within that aesthetic space rather than reproducing training images.
  • Figure 4: The self-distillation training loop. Starting from morphological descriptors alone, the model iteratively generates images, curates outputs, trains a LoRA adapter on its own successful outputs, and refines prompts. No target name or reference photographs are used at any stage. Hit rate improved from 8% to 70% across four rounds.
  • Figure 5: Training progression across four rounds of self-distillation. Round 1 (top) shows high variance with only 8.1% of outputs approximating the target. By Round 4 (bottom), outputs exhibit binary behavior---landing clearly in the target basin or ejecting entirely.
  • ...and 6 more figures