Table of Contents
Fetching ...

Geometry of Knowledge Allows Extending Diversity Boundaries of Large Language Models

Mateusz Bystroński, Doheon Han, Nitesh V. Chawla, Tomasz Kajdanowicz

TL;DR

The paper tackles the problem of limited generative diversity in large language models by introducing a plug-in, fine-tuning-free approach: continuous semantic conditioning along a structured embedding manifold. It constructs a latent variable $z$ from anchor-generated semantic anchors using interpolation, then maps $z$ into the LLM's embedding space via a multimodal projector (xRAG-style) to condition generation. This latent conditioning expands the semantic variance of outputs without sacrificing quality, demonstrated on NoveltyBench and the AUT divergent-thinking task, with analyses showing robust gains and favorable trade-offs depending on anchor choice and interpolation strength. The approach reframes diversity as geometric exploration in semantic space, enabling metaheuristic search and offering a scalable path to enhanced creativity in language models while avoiding parameter updates to the base model.

Abstract

Starting from the hypothesis that knowledge in semantic space is organized along structured manifolds, we argue that this geometric structure renders the space explorable. By traversing it and using the resulting continuous representations to condition an LLM's generation distribution, we can systematically expand the model's reachable semantic range. We introduce a framework that requires no modification of LLM parameters and operationalizes this idea by constructing a conditioning distribution from a small set of diverse anchor generations. This distribution conditions LLM's generation via an xRAG-style projector. Our experiments demonstrate that this manifold-based conditioning substantially increases generative diversity, with direct benefits for enhancing divergent thinking, a core facet of creativity, in language models.

Geometry of Knowledge Allows Extending Diversity Boundaries of Large Language Models

TL;DR

The paper tackles the problem of limited generative diversity in large language models by introducing a plug-in, fine-tuning-free approach: continuous semantic conditioning along a structured embedding manifold. It constructs a latent variable from anchor-generated semantic anchors using interpolation, then maps into the LLM's embedding space via a multimodal projector (xRAG-style) to condition generation. This latent conditioning expands the semantic variance of outputs without sacrificing quality, demonstrated on NoveltyBench and the AUT divergent-thinking task, with analyses showing robust gains and favorable trade-offs depending on anchor choice and interpolation strength. The approach reframes diversity as geometric exploration in semantic space, enabling metaheuristic search and offering a scalable path to enhanced creativity in language models while avoiding parameter updates to the base model.

Abstract

Starting from the hypothesis that knowledge in semantic space is organized along structured manifolds, we argue that this geometric structure renders the space explorable. By traversing it and using the resulting continuous representations to condition an LLM's generation distribution, we can systematically expand the model's reachable semantic range. We introduce a framework that requires no modification of LLM parameters and operationalizes this idea by constructing a conditioning distribution from a small set of diverse anchor generations. This distribution conditions LLM's generation via an xRAG-style projector. Our experiments demonstrate that this manifold-based conditioning substantially increases generative diversity, with direct benefits for enhancing divergent thinking, a core facet of creativity, in language models.

Paper Structure

This paper contains 24 sections, 1 theorem, 40 equations, 4 figures, 3 tables.

Key Result

Proposition 1

Assume the “ground-truth” semantic assignment would require Let be the low-density valley set separating decoder clusters, for some $\tau < \epsilon$. If a continuous $f$ satisfies then necessarily Thus any continuous splitting of the single latent component into multiple decoder semantic islands must traverse the valley between them.

Figures (4)

  • Figure 1: Given an input prompt, a base LLM first generates a small set of candidate outputs. These outputs are encoded into continuous semantic embeddings, forming a local semantic manifold. New vectors are sampled from this manifold and mapped via a xRAG projector cheng2024xragextremecontextcompression into the LLM’s embedding space. The LLM then generates new outputs conditioned on these sampled embeddings.
  • Figure 2: Cumulative originality curves for our latent-space exploration method. As more latent samples are drawn, the Top--1, Top--2, and Top--3 originality scores steadily increase.
  • Figure 3: Ablation over $\lambda$. Small values keep the latent variable inside the anchor cluster, yielding low diversity; larger values explore broader semantic regions, improving diversity without harming quality.
  • Figure 4: Graphical comparition of prompt based methods with our approach.

Theorems & Definitions (2)

  • Proposition 1: VAE Splitting Implies Semantic Valley Traversal
  • proof : Sketch