Table of Contents
Fetching ...

Evaluating Embedding Generalization: How LLMs, LoRA, and SLERP Shape Representational Geometry

Siyaxolisa Kabane

TL;DR

This study benchmarks how embedding backbones (LLMs vs non-LLMs) and adaptation/merging strategies shape the geometry of embeddings on a controlled numerical-sequence task. It systematically compares LoRA adapters, model soups, and SLERP merging, evaluating representational quality with clustering metrics and visualizations. The findings show that LLM-based embeddings offer stronger generalization to non-linguistic structure, but adapter-induced specialization can erode geometry, with SLERP merging providing robust tradeoffs by preserving base structure while retaining task gains. The work highlights the importance of weight-space geometry for generalization and points to SLERP as a promising tool for integrating specialization without sacrificing robustness, suggesting avenues for scaling and hybrid pretraining for non-linguistic reasoning.

Abstract

We investigate the generalization properties of dense text embeddings when the embedding backbone is a large language model (LLM) versus when it is a non-LLM encoder, and we study the extent to which spherical linear interpolation (SLERP) model-merging mitigates over-specialization introduced by task-specific adaptation (e.g., LoRA). To make the comparison concrete and domain-agnostic, we design a controlled suite of experiments in which models embed short numerical sequences and are evaluated on their ability to cluster and classify those sequences according to well-defined number-theoretic properties. Our experimental protocol compares four families of models: (1) non-LLM encoders trained from scratch or fine-tuned for embeddings, (2) LLM-based encoders adapted with parameter-efficient methods (LoRA), (3) LLM-based encoders with LoRA followed by model souping merging into the base weights, and (4) the same LoRA-adapted LLMs merged using SLERP across checkpoints or stages. We evaluate representational quality with clustering indices (Silhouette and Davies Bouldin). We additionally analyze the use of kmeans labels to see if the embeddings encode any other information besides the one we are testing for. Empirically, we find that LLM-based backbones produce embeddings that better capture higher-order, compositional numeric patterns, but are prone to adapter dominance that degrades balanced generalization; SLERP merging consistently recovers base-model structure while retaining most task gains, yielding superior tradeoffs in clustering separability, and robustness compared to model souping or models that were not merged.

Evaluating Embedding Generalization: How LLMs, LoRA, and SLERP Shape Representational Geometry

TL;DR

This study benchmarks how embedding backbones (LLMs vs non-LLMs) and adaptation/merging strategies shape the geometry of embeddings on a controlled numerical-sequence task. It systematically compares LoRA adapters, model soups, and SLERP merging, evaluating representational quality with clustering metrics and visualizations. The findings show that LLM-based embeddings offer stronger generalization to non-linguistic structure, but adapter-induced specialization can erode geometry, with SLERP merging providing robust tradeoffs by preserving base structure while retaining task gains. The work highlights the importance of weight-space geometry for generalization and points to SLERP as a promising tool for integrating specialization without sacrificing robustness, suggesting avenues for scaling and hybrid pretraining for non-linguistic reasoning.

Abstract

We investigate the generalization properties of dense text embeddings when the embedding backbone is a large language model (LLM) versus when it is a non-LLM encoder, and we study the extent to which spherical linear interpolation (SLERP) model-merging mitigates over-specialization introduced by task-specific adaptation (e.g., LoRA). To make the comparison concrete and domain-agnostic, we design a controlled suite of experiments in which models embed short numerical sequences and are evaluated on their ability to cluster and classify those sequences according to well-defined number-theoretic properties. Our experimental protocol compares four families of models: (1) non-LLM encoders trained from scratch or fine-tuned for embeddings, (2) LLM-based encoders adapted with parameter-efficient methods (LoRA), (3) LLM-based encoders with LoRA followed by model souping merging into the base weights, and (4) the same LoRA-adapted LLMs merged using SLERP across checkpoints or stages. We evaluate representational quality with clustering indices (Silhouette and Davies Bouldin). We additionally analyze the use of kmeans labels to see if the embeddings encode any other information besides the one we are testing for. Empirically, we find that LLM-based backbones produce embeddings that better capture higher-order, compositional numeric patterns, but are prone to adapter dominance that degrades balanced generalization; SLERP merging consistently recovers base-model structure while retaining most task gains, yielding superior tradeoffs in clustering separability, and robustness compared to model souping or models that were not merged.

Paper Structure

This paper contains 22 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Silhouette scores for all models.
  • Figure 2: Davies-Bouldin scores for all models.
  • Figure 3: EmbeddingGemma Plot
  • Figure 4: GTE Multilingual Base Plot
  • Figure 5: GTE-large Plot
  • ...and 11 more figures