Table of Contents
Fetching ...

NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification

Jun Hu, Yufei He, Yuan Li, Bryan Hooi, Bingsheng He

TL;DR

This work tackles multimodal isolated cold-start node classification, where test nodes have no edges and may miss modalities. It introduces NTSFormer, a self-teaching Graph Transformer that jointly handles isolation and missing modalities by generating a student output from self-information and a teacher output that uses neighbor context, trained end-to-end without devolving to an MLP. A one-time multimodal graph pre-computation converts neighborhood information up to $K$ hops into token sequences, which are fused via a Mixture-of-Experts input projection before Transformer processing. Empirical results on Movies, Ele-fashion, and Goodreads-NC show that NTSFormer consistently outperforms baselines under Text-Miss, Visual-Miss, No-Miss, and All settings, highlighting the method’s effectiveness and scalability for multimodal graphs with isolated nodes. The approach significantly reduces train-test distribution gaps and enables robust performance when modalities are missing, with potential extensions to dynamic graphs.

Abstract

Isolated cold-start node classification on multimodal graphs is challenging because such nodes have no edges and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to multilayer perceptrons (MLPs) for isolated cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer's capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experiments on public datasets show that NTSFormer achieves superior performance for multimodal isolated cold-start node classification.

NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification

TL;DR

This work tackles multimodal isolated cold-start node classification, where test nodes have no edges and may miss modalities. It introduces NTSFormer, a self-teaching Graph Transformer that jointly handles isolation and missing modalities by generating a student output from self-information and a teacher output that uses neighbor context, trained end-to-end without devolving to an MLP. A one-time multimodal graph pre-computation converts neighborhood information up to hops into token sequences, which are fused via a Mixture-of-Experts input projection before Transformer processing. Empirical results on Movies, Ele-fashion, and Goodreads-NC show that NTSFormer consistently outperforms baselines under Text-Miss, Visual-Miss, No-Miss, and All settings, highlighting the method’s effectiveness and scalability for multimodal graphs with isolated nodes. The approach significantly reduces train-test distribution gaps and enables robust performance when modalities are missing, with potential extensions to dynamic graphs.

Abstract

Isolated cold-start node classification on multimodal graphs is challenging because such nodes have no edges and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to multilayer perceptrons (MLPs) for isolated cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer's capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experiments on public datasets show that NTSFormer achieves superior performance for multimodal isolated cold-start node classification.

Paper Structure

This paper contains 26 sections, 17 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The multimodal isolated cold‑start node classification task focuses on classifying isolated cold‑start nodes that have no edges and may be missing certain modalities.
  • Figure 2: Performance on multimodal isolated cold-start node classification. General GNNs (GraphSAGE) and multimodal GNNs (MMGCN, MGAT) even underperform MLPs.
  • Figure 3: NTSFormer uses a cold-start mask to make two predictions: a "student" prediction based on self-features only, and a "teacher" prediction with both self and neighbor information, enabling it to supervise itself without degrading to an MLP, thereby leveraging Transformers' capacity.
  • Figure 4: Overall framework of NTSFormer.
  • Figure 5: Ablation results of NTSFormer and its ablated variants (w/o MMPre, w/o MoE, w/o SelfTeach). The first three subfigures report performance on different subsets of test nodes, each with a specific modality-missing setting: (a) Text-Miss, (b) Visual-Miss, and (c) No-Miss. The last subfigure (d) reports performance on all test nodes.
  • ...and 1 more figures