Token-Level Graphs for Short Text Classification
Gregor Donabauer, Udo Kruschwitz
TL;DR
This work tackles short text classification in low-resource settings by addressing limitations of transductive graph methods and context-insensitive word representations. It introduces token-level graphs built from a PLM's tokenizer, where each text sample forms a distinct graph with nodes as tokens and context-derived embeddings $X=PLM(S_T)$, connected within an $n$-hop neighborhood, enabling an inductive learning setup. A two-layer Graph Attention Network with $d=128$ hidden units aggregates node representations to a per-sample graph embedding, while reducing parameters compared to PLM fine-tuning and enhancing robustness to small datasets. Experiments on Twitter, MR, Snippets, and TagMyNews show competitive or superior performance relative to strong baselines, particularly in low-resource or domain-specific contexts, with code released for reproducibility.
Abstract
The classification of short texts is a common subtask in Information Retrieval (IR). Recent advances in graph machine learning have led to interest in graph-based approaches for low resource scenarios, showing promise in such settings. However, existing methods face limitations such as not accounting for different meanings of the same words or constraints from transductive approaches. We propose an approach which constructs text graphs entirely based on tokens obtained through pre-trained language models (PLMs). By applying a PLM to tokenize and embed the texts when creating the graph(-nodes), our method captures contextual and semantic information, overcomes vocabulary constraints, and allows for context-dependent word meanings. Our approach also makes classification more efficient with reduced parameters compared to classical PLM fine-tuning, resulting in more robust training with few samples. Experimental results demonstrate how our method consistently achieves higher scores or on-par performance with existing methods, presenting an advancement in graph-based text classification techniques. To support reproducibility of our work we make all implementations publicly available to the community\footnote{\url{https://github.com/doGregor/TokenGraph}}.
