Table of Contents
Fetching ...

Towards Improved Sentence Representations using Token Graphs

Krishna Sri Ipsit Mantri, Carola-Bibiane Schönlieb, Zorah Lähner, Moshe Eliasof

TL;DR

GLOT is a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation, and shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs.

Abstract

Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it is competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs. Our code is published at https://github.com/ipsitmantri/GLOT.

Towards Improved Sentence Representations using Token Graphs

TL;DR

GLOT is a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation, and shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs.

Abstract

Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it is competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs. Our code is published at https://github.com/ipsitmantri/GLOT.
Paper Structure (46 sections, 2 equations, 5 figures, 17 tables, 2 algorithms)

This paper contains 46 sections, 2 equations, 5 figures, 17 tables, 2 algorithms.

Figures (5)

  • Figure 1: Fine-tuning large language models for sentence embeddings is computationally expensive. Our pooling method, Glot, constructs a latent token-similarity graph from the outputs of a frozen model. It then refines token representations with a graph neural network before aggregation. This technique enables decoder-only models (like Mistral-7B), typically optimized for next-token prediction, to produce powerful sentence-level representations without requiring any fine-tuning.
  • Figure 2: An overview of the Glot pooling architecture. Given token hidden states from a frozen language model, our trainable module performs three stages : (1) it constructs a latent token-similarity graph, (2) a TOKEN-GNN performs relational learning to refine token representations, and (3) a readout layer aggregates the refined vectors into a final sentence representation, ${\mathbf{z}}$
  • Figure 3: Robustness to signal dilution on the diagnostic stress test. Each of the four panels displays the classification accuracy for all pooling methods at a specific distractor ratio, which increases from 20% to 90%. Within each panel, backbone models are arranged along the x-axis by their parameter count.
  • Figure 4: Z-score normalized performance on the GLUE benchmark, aggregated by task category. Performance, represented as a z-score, is plotted against the number of parameters in the frozen backbone model (log scale). A higher z-score indicates better relative performance compared to the average of all tested methods for that setting.
  • Figure 5: Token Contribution Analysis with frozen BERT. Visualization of learned token weights ($\pi$) on 2 examples. The orange highlights on the X-axis indicate the top-3 scoring tokens identified by Glot. While Mean Pooling (green) is uniform and AdaPool (grey) tends to over-index on high-frequency functional words, Glot (blue) consistently up-weights the semantic anchors essential for determining sentence equivalence.