NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification
Jun Hu, Yufei He, Yuan Li, Bryan Hooi, Bingsheng He
TL;DR
This work tackles multimodal isolated cold-start node classification, where test nodes have no edges and may miss modalities. It introduces NTSFormer, a self-teaching Graph Transformer that jointly handles isolation and missing modalities by generating a student output from self-information and a teacher output that uses neighbor context, trained end-to-end without devolving to an MLP. A one-time multimodal graph pre-computation converts neighborhood information up to $K$ hops into token sequences, which are fused via a Mixture-of-Experts input projection before Transformer processing. Empirical results on Movies, Ele-fashion, and Goodreads-NC show that NTSFormer consistently outperforms baselines under Text-Miss, Visual-Miss, No-Miss, and All settings, highlighting the method’s effectiveness and scalability for multimodal graphs with isolated nodes. The approach significantly reduces train-test distribution gaps and enables robust performance when modalities are missing, with potential extensions to dynamic graphs.
Abstract
Isolated cold-start node classification on multimodal graphs is challenging because such nodes have no edges and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to multilayer perceptrons (MLPs) for isolated cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer's capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experiments on public datasets show that NTSFormer achieves superior performance for multimodal isolated cold-start node classification.
