Combining LLM Semantic Reasoning with GNN Structural Modeling for Multi-View Multi-Label Feature Selection
Zhiqi Chen, Yuzhou Liu, Jiarui Liu, Wanfu Gao
TL;DR
This paper tackles multi-view multi-label feature selection by integrating semantic priors derived from Large Language Models with graph-based structural modeling. It introduces a semantic-aware two-level heterogeneous graph that fuses LLM-derived semantic edges with statistical edges computed from mutual information and label co-occurrence, and it uses a type-aware Graph Attention Network to learn feature embeddings and saliency scores. Empirical results on six benchmark datasets show consistent improvements over state-of-the-art baselines and demonstrate robustness on small-scale datasets, highlighting the value of combining semantic and statistical information. The approach offers practical gains for high-dimensional, multimodal data and opens avenues for end-to-end LLM feedback and self-supervised pretraining on heterogeneous graphs.
Abstract
Multi-view multi-label feature selection aims to identify informative features from heterogeneous views, where each sample is associated with multiple interdependent labels. This problem is particularly important in machine learning involving high-dimensional, multimodal data such as social media, bioinformatics or recommendation systems. Existing Multi-View Multi-Label Feature Selection (MVMLFS) methods mainly focus on analyzing statistical information of data, but seldom consider semantic information. In this paper, we aim to use these two types of information jointly and propose a method that combines Large Language Models (LLMs) semantic reasoning with Graph Neural Networks (GNNs) structural modeling for MVMLFS. Specifically, the method consists of three main components. (1) LLM is first used as an evaluation agent to assess the latent semantic relevance among feature, view, and label descriptions. (2) A semantic-aware heterogeneous graph with two levels is designed to represent relations among features, views and labels: one is a semantic graph representing semantic relations, and the other is a statistical graph. (3) A lightweight Graph Attention Network (GAT) is applied to learn node embedding in the heterogeneous graph as feature saliency scores for ranking and selection. Experimental results on multiple benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines, and it is still effective when applied to small-scale datasets, showcasing its robustness, flexibility, and generalization ability.
