Table of Contents
Fetching ...

Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings

Shikun Liu, Haoyu Wang, Mufei Li, Pan Li

TL;DR

This work investigates embedding quality by integrating structural relationships (e.g., hyperlinks, co-purchases, citations) directly into the LLM encoding process rather than via post-hoc aggregation. It introduces two structure-aware in-process strategies, Struc-Emb-Seq (sequential concatenation) and Struc-Emb-Par (parallel KV caching), along with Context Distillation and Semantic Balancing to combat noisy context. Zero-shot experiments across retrieval, clustering, classification, and recommendation show consistent gains over text-only and post-hoc baselines, with clear trade-offs between sequential and parallel methods as context length and noise vary. The findings offer a blueprint for building more contextually aware embeddings and have practical implications for applications that rely on rich structural information in data.

Abstract

Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM's internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.

Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings

TL;DR

This work investigates embedding quality by integrating structural relationships (e.g., hyperlinks, co-purchases, citations) directly into the LLM encoding process rather than via post-hoc aggregation. It introduces two structure-aware in-process strategies, Struc-Emb-Seq (sequential concatenation) and Struc-Emb-Par (parallel KV caching), along with Context Distillation and Semantic Balancing to combat noisy context. Zero-shot experiments across retrieval, clustering, classification, and recommendation show consistent gains over text-only and post-hoc baselines, with clear trade-offs between sequential and parallel methods as context length and noise vary. The findings offer a blueprint for building more contextually aware embeddings and have practical implications for applications that rely on rich structural information in data.

Abstract

Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM's internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.

Paper Structure

This paper contains 24 sections, 3 equations, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Given a target and its related segments, individual encoding and post-hoc aggregation serve as baselines that separate embedding and structure aggregation. Structure-aware encoding instead injects structural relations during encoding via Struc-Emb-Seq and Struc-Emb-Par, further enhanced by semantic balancing and context distillation for robust structural information utilization.
  • Figure 2: This plot shows the performance trend of Individual encoding, Struc-Emb-Seq and Struc-Emb-Par when we increase the text segment length of both target and related segments.
  • Figure 3: The computation time comparison for different encoding methods under MuSiQue dataset (selection using degree) with varying text length and number of related segments.
  • Figure 4: Performance variation in Struc-Emb-Seq and Struc-Emb-Par when permuting the order of related segments.
  • Figure 5: $\alpha$ sensitivity study for MuSiQue datasets
  • ...and 5 more figures