Table of Contents
Fetching ...

Masked Graph Transformer for Large-Scale Recommendation

Huiyuan Chen, Zhe Xu, Chin-Chia Michael Yeh, Vivian Lai, Yan Zheng, Minghua Xu, Hanghang Tong

TL;DR

This paper tackles the scalability challenge of Graph Transformers in large-scale recommendation by introducing MGFormer, a pure Transformer with a linear-time masked kernel attention. It learns all-pair user/item interactions by treating nodes as independent tokens, enriched with SVD-based structural encodings and a learnable sine-based degree mask to reweight attention. The approach uses Simplex Random Features for kernel approximation and DirectAU-inspired alignment/uniformity losses, achieving competitive Recall@20 and NDCG@20 across Beauty, Yelp, and Alibaba datasets with a single attention layer. The findings highlight that pure Transformer architectures, when equipped with topology-aware masking and efficient kernel attention, can match or exceed GNN-based methods while maintaining scalable training for large graphs.

Abstract

Graph Transformers have garnered significant attention for learning graph-structured data, thanks to their superb ability to capture long-range dependencies among nodes. However, the quadratic space and time complexity hinders the scalability of Graph Transformers, particularly for large-scale recommendation. Here we propose an efficient Masked Graph Transformer, named MGFormer, capable of capturing all-pair interactions among nodes with a linear complexity. To achieve this, we treat all user/item nodes as independent tokens, enhance them with positional embeddings, and feed them into a kernelized attention module. Additionally, we incorporate learnable relative degree information to appropriately reweigh the attentions. Experimental results show the superior performance of our MGFormer, even with a single attention layer.

Masked Graph Transformer for Large-Scale Recommendation

TL;DR

This paper tackles the scalability challenge of Graph Transformers in large-scale recommendation by introducing MGFormer, a pure Transformer with a linear-time masked kernel attention. It learns all-pair user/item interactions by treating nodes as independent tokens, enriched with SVD-based structural encodings and a learnable sine-based degree mask to reweight attention. The approach uses Simplex Random Features for kernel approximation and DirectAU-inspired alignment/uniformity losses, achieving competitive Recall@20 and NDCG@20 across Beauty, Yelp, and Alibaba datasets with a single attention layer. The findings highlight that pure Transformer architectures, when equipped with topology-aware masking and efficient kernel attention, can match or exceed GNN-based methods while maintaining scalable training for large graphs.

Abstract

Graph Transformers have garnered significant attention for learning graph-structured data, thanks to their superb ability to capture long-range dependencies among nodes. However, the quadratic space and time complexity hinders the scalability of Graph Transformers, particularly for large-scale recommendation. Here we propose an efficient Masked Graph Transformer, named MGFormer, capable of capturing all-pair interactions among nodes with a linear complexity. To achieve this, we treat all user/item nodes as independent tokens, enhance them with positional embeddings, and feed them into a kernelized attention module. Additionally, we incorporate learnable relative degree information to appropriately reweigh the attentions. Experimental results show the superior performance of our MGFormer, even with a single attention layer.
Paper Structure (15 sections, 15 equations, 2 figures, 3 tables)

This paper contains 15 sections, 15 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the proposed MGFormer.
  • Figure 2: Performance for different item groups.