CNN2GNN: How to Bridge CNN with GNN
Ziheng Jiao, Hongyuan Zhang, Xuelong Li
TL;DR
The paper tackles the challenge of combining CNNs’ intra-sample feature extraction with GNNs’ ability to model topological relationships by introducing CNN2GNN, a framework that distills knowledge from a large CNN into a compact GNN. A differentiable sparse graph head is proposed to learn inductive graph structures for non-graph data, enabling efficient graph-based aggregation during training and inference. The method uses a heterogeneous distillation loss that blends cross-entropy with a temperature-scaled KL divergence to transfer the CNN’s deep intra-sample representations and the relational structure to the GNN. Empirical results on STL-10, CIFAR-100, and Mini-ImageNet show that the distilled GNN achieves superior accuracy and efficiency, sometimes surpassing the teacher and far outperforming CNN baselines with many layers.
Abstract
Although the convolutional neural network (CNN) has achieved excellent performance in vision tasks by extracting the intra-sample representation, it will take a higher training expense because of stacking numerous convolutional layers. Recently, as the bilinear models, graph neural networks (GNN) have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, it cannot be directly utilized on non-graph data due to the lack of graph structure and has high inference latency on large-scale scenarios. Inspired by these complementary strengths and weaknesses, \textit{we discuss a natural question, how to bridge these two heterogeneous networks?} In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. Firstly, to break the limitations of GNN, a differentiable sparse graph learning module is designed as the head of networks to dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge from CNN to GNN and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of distilled ``boosted'' two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers such as ResNet152.
