Table of Contents
Fetching ...

CNN2GNN: How to Bridge CNN with GNN

Ziheng Jiao, Hongyuan Zhang, Xuelong Li

TL;DR

The paper tackles the challenge of combining CNNs’ intra-sample feature extraction with GNNs’ ability to model topological relationships by introducing CNN2GNN, a framework that distills knowledge from a large CNN into a compact GNN. A differentiable sparse graph head is proposed to learn inductive graph structures for non-graph data, enabling efficient graph-based aggregation during training and inference. The method uses a heterogeneous distillation loss that blends cross-entropy with a temperature-scaled KL divergence to transfer the CNN’s deep intra-sample representations and the relational structure to the GNN. Empirical results on STL-10, CIFAR-100, and Mini-ImageNet show that the distilled GNN achieves superior accuracy and efficiency, sometimes surpassing the teacher and far outperforming CNN baselines with many layers.

Abstract

Although the convolutional neural network (CNN) has achieved excellent performance in vision tasks by extracting the intra-sample representation, it will take a higher training expense because of stacking numerous convolutional layers. Recently, as the bilinear models, graph neural networks (GNN) have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, it cannot be directly utilized on non-graph data due to the lack of graph structure and has high inference latency on large-scale scenarios. Inspired by these complementary strengths and weaknesses, \textit{we discuss a natural question, how to bridge these two heterogeneous networks?} In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. Firstly, to break the limitations of GNN, a differentiable sparse graph learning module is designed as the head of networks to dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge from CNN to GNN and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of distilled ``boosted'' two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers such as ResNet152.

CNN2GNN: How to Bridge CNN with GNN

TL;DR

The paper tackles the challenge of combining CNNs’ intra-sample feature extraction with GNNs’ ability to model topological relationships by introducing CNN2GNN, a framework that distills knowledge from a large CNN into a compact GNN. A differentiable sparse graph head is proposed to learn inductive graph structures for non-graph data, enabling efficient graph-based aggregation during training and inference. The method uses a heterogeneous distillation loss that blends cross-entropy with a temperature-scaled KL divergence to transfer the CNN’s deep intra-sample representations and the relational structure to the GNN. Empirical results on STL-10, CIFAR-100, and Mini-ImageNet show that the distilled GNN achieves superior accuracy and efficiency, sometimes surpassing the teacher and far outperforming CNN baselines with many layers.

Abstract

Although the convolutional neural network (CNN) has achieved excellent performance in vision tasks by extracting the intra-sample representation, it will take a higher training expense because of stacking numerous convolutional layers. Recently, as the bilinear models, graph neural networks (GNN) have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, it cannot be directly utilized on non-graph data due to the lack of graph structure and has high inference latency on large-scale scenarios. Inspired by these complementary strengths and weaknesses, \textit{we discuss a natural question, how to bridge these two heterogeneous networks?} In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. Firstly, to break the limitations of GNN, a differentiable sparse graph learning module is designed as the head of networks to dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge from CNN to GNN and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of distilled ``boosted'' two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers such as ResNet152.
Paper Structure (14 sections, 1 theorem, 7 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 14 sections, 1 theorem, 7 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Given a set of samples $\mathcal{V}=\left\{ \bm v_i | n=1,...,n \right\}$, the conditional probability $p(\bm v | \bm v_i)$ can be formulated from where $\bm \pi$ is a uniform distribution, ${\rm dist}(\cdot, \cdot)$ represents the $\ell_2$-norm distance, and $\gamma_i$ is the trade-off parameter. And Eq. (graph_solve) is equivalent to a solution form of this problem.

Figures (6)

  • Figure 1: Merits of CNN and GNN. Figure \ref{['CNN_frame']}: CNN can extract the intra-sample representation such as the trunk and contour of the animal in the image. Figure \ref{['GNN_frame']}: GNN will learn the explore the relationship among the nodes in the social graph data.
  • Figure 2: A framework of the proposed CNN2GNN model. Figure \ref{['framework1']} is the training procedure. CNN teacher and GNN student are utilized to learn the intra-sample representation and the latent topological relationship, respectively. $\mathcal{L}_{\rm student}$ can brige these two heterogeneous networks and transfer knowledge from CNN to GNN. Figure \ref{['framework2']} and Figure \ref{['framework3']} are the inductive inference with Mechanism \ref{['one_instance']} and \ref{['batch_instance']}, respectively. Among them, Figure \ref{['framework2']} cascades a test instance with a training batch for prediction. Figure \ref{['framework3']} selects the most similar sample in the training set to learn an approximated graph structure for evaluating the testing samples batch-by-batch.
  • Figure 3: Visualization of the topological relationship learned by the differentiable sparse graph head on STL-10. The sparsity $s$ is $3$. The first and second rows show that the graph head can accurately learn the relationship among the instances in the same class. The bottom row suggests that the instance in different classes will be assigned a small similarity even $0$.
  • Figure 4: Accuracy of CNN2GNN w.r.t the varying parameter $\tau \in \left\{ 2^{0}, 2^{2}, 2^{3}, 2^4, 2^5, 2^6 \right\}$ and $s \in \left\{ 10, 30, 50, 70, 90 \right\}$. Figure \ref{['ResNet50-18-Cifar100']} and \ref{['VGG13-8-Cifar100']} are the results on CIFAR-100. Figure \ref{['ResNet50-18-ImageNet']} and \ref{['VGG13-8-ImageNet']} are the results on Mini-Imagenet.
  • Figure 5: Performance comparison between ours and other models on CIFAR-100. Smaller FLOPs represent more efficient models. Higher accuracy represents models have more excellent performance.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1