Table of Contents
Fetching ...

GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs

Liangwei Yang, Jing Ma, Jianguo Zhang, Zhiwei Liu, Jielin Qiu, Shirley Kokane, Shiyu Wang, Haolin Chen, Rithesh Murthy, Ming Zhu, Huan Wang, Weiran Yao, Caiming Xiong, Shelby Heinecke

TL;DR

This work identifies semantic drift as a key limitation when applying traditional linear neighborhood aggregation to PLM-derived embeddings in text-attributed graphs. It introduces a local PCA-based metric to quantify drift and proposes Geodesic Aggregation, which operates along geodesics on the semantic manifold via log–exp maps on the unit sphere, forming the GeoGNN model with spherical attention. Across four CS-TAG datasets and diverse encoders, GeoGNN consistently outperforms strong baselines and exhibits reduced manifold distortion, validating the value of geometry-aware message passing. The findings highlight the importance of respecting the intrinsic geometry of textual representations for robust and scalable text–graph learning.

Abstract

Graph neural networks (GNNs) on text--attributed graphs (TAGs) typically encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation. However, the representation spaces of modern PLMs are highly non--linear and geometrically structured, where textual embeddings reside on curved semantic manifolds rather than flat Euclidean spaces. Linear aggregation on such manifolds inevitably distorts geometry and causes semantic drift--a phenomenon where aggregated representations deviate from the intrinsic manifold, losing semantic fidelity and expressive power. To quantitatively investigate this problem, this work introduces a local PCA--based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure. Building upon these insights, we propose Geodesic Aggregation, a manifold--aware mechanism that aggregates neighbor information along geodesics via log--exp mappings on the unit sphere, ensuring that representations remain faithful to the semantic manifold during message passing. We further develop GeoGNN, a practical instantiation that integrates spherical attention with manifold interpolation. Extensive experiments across four benchmark datasets and multiple text encoders show that GeoGNN substantially mitigates semantic drift and consistently outperforms strong baselines, establishing the importance of manifold--aware aggregation in text--attributed graph learning.

GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs

TL;DR

This work identifies semantic drift as a key limitation when applying traditional linear neighborhood aggregation to PLM-derived embeddings in text-attributed graphs. It introduces a local PCA-based metric to quantify drift and proposes Geodesic Aggregation, which operates along geodesics on the semantic manifold via log–exp maps on the unit sphere, forming the GeoGNN model with spherical attention. Across four CS-TAG datasets and diverse encoders, GeoGNN consistently outperforms strong baselines and exhibits reduced manifold distortion, validating the value of geometry-aware message passing. The findings highlight the importance of respecting the intrinsic geometry of textual representations for robust and scalable text–graph learning.

Abstract

Graph neural networks (GNNs) on text--attributed graphs (TAGs) typically encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation. However, the representation spaces of modern PLMs are highly non--linear and geometrically structured, where textual embeddings reside on curved semantic manifolds rather than flat Euclidean spaces. Linear aggregation on such manifolds inevitably distorts geometry and causes semantic drift--a phenomenon where aggregated representations deviate from the intrinsic manifold, losing semantic fidelity and expressive power. To quantitatively investigate this problem, this work introduces a local PCA--based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure. Building upon these insights, we propose Geodesic Aggregation, a manifold--aware mechanism that aggregates neighbor information along geodesics via log--exp mappings on the unit sphere, ensuring that representations remain faithful to the semantic manifold during message passing. We further develop GeoGNN, a practical instantiation that integrates spherical attention with manifold interpolation. Extensive experiments across four benchmark datasets and multiple text encoders show that GeoGNN substantially mitigates semantic drift and consistently outperforms strong baselines, establishing the importance of manifold--aware aggregation in text--attributed graph learning.

Paper Structure

This paper contains 22 sections, 16 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of Semantic Drift: linear averaging ($n_4$) deviates from the semantic manifold, while geodesic averaging ($n_3$) stays on the manifold.
  • Figure 2: Comparison of four aggregators (Mean, Laplacian, Attention, Geodesic) on Photo and History dataset. Our Geodesic Aggregator preserves the manifold structure and mitigates semantic drift.
  • Figure 3: Quantifying semantic drift over aggregators.
  • Figure 4: Overall framework of GeoGNN. Node texts are encoded by a frozen pretrained language model (PLM) and projected onto a spherical manifold through linear projection and normalization. GeoGNN then performs geometry-preserving message passing by (a) mapping neighbor embeddings to the tangent space (log map), (b) aggregating them via geodesic attention, and (c) projecting results back to the manifold (exp map). This design preserves representation manifold fidelity.
  • Figure 5: Comparison between GNNs over different encoders
  • ...and 2 more figures

Theorems & Definitions (1)

  • definition 1: Semantic Drift