Table of Contents
Fetching ...

Improving vision-language alignment with graph spiking hybrid Networks

Siyu Zhang, Wenzhe Liu, Yeming Chen, Yiming Wu, Heming Zheng, Cheng Cheng

TL;DR

This work addresses vision-language alignment by enriching visual representations through panoptic segmentation and a Graph Spiking Hybrid Network (GSHN) that fuses continuous GAT-based encoding with discrete SNN-based encoding to capture local and global context. It introduces contrastive learning and a Spiked Text Learning (STL) pretraining task to align VL semantics, and uses a fused output $V' = r \cdot E_{SNN} + F_{GAT}$ with a learned weight ratio $r$. The approach achieves competitive results on VQA, VE, image-text retrieval, and NLVR$^2$, with notable gains in fine-grained semantic understanding and energy-efficient computation due to sparsity. The work demonstrates a scalable pathway for robust VL alignment by combining panoptic visual tokens, hybrid network architectures, and targeted pretraining tasks.

Abstract

To bridge the semantic gap between vision and language (VL), it is necessary to develop a good alignment strategy, which includes handling semantic diversity, abstract representation of visual information, and generalization ability of models. Recent works use detector-based bounding boxes or patches with regular partitions to represent visual semantics. While current paradigms have made strides, they are still insufficient for fully capturing the nuanced contextual relations among various objects. This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate coherent fine-grained semantic features. Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information. Intriguingly, the model not only encodes the discrete and continuous latent variables of instances but also adeptly captures both local and global contextual features, thereby significantly enhancing the richness and diversity of semantic representations. Leveraging the spatiotemporal properties inherent in SNNs, we employ contrastive learning (CL) to enhance the similarity-based representation of embeddings. This strategy alleviates the computational overhead of the model and enriches meaningful visual representations by constructing positive and negative sample pairs. We design an innovative pre-training method, Spiked Text Learning (STL), which uses text features to improve the encoding ability of discrete semantics. Experiments show that the proposed GSHN exhibits promising results on multiple VL downstream tasks.

Improving vision-language alignment with graph spiking hybrid Networks

TL;DR

This work addresses vision-language alignment by enriching visual representations through panoptic segmentation and a Graph Spiking Hybrid Network (GSHN) that fuses continuous GAT-based encoding with discrete SNN-based encoding to capture local and global context. It introduces contrastive learning and a Spiked Text Learning (STL) pretraining task to align VL semantics, and uses a fused output with a learned weight ratio . The approach achieves competitive results on VQA, VE, image-text retrieval, and NLVR, with notable gains in fine-grained semantic understanding and energy-efficient computation due to sparsity. The work demonstrates a scalable pathway for robust VL alignment by combining panoptic visual tokens, hybrid network architectures, and targeted pretraining tasks.

Abstract

To bridge the semantic gap between vision and language (VL), it is necessary to develop a good alignment strategy, which includes handling semantic diversity, abstract representation of visual information, and generalization ability of models. Recent works use detector-based bounding boxes or patches with regular partitions to represent visual semantics. While current paradigms have made strides, they are still insufficient for fully capturing the nuanced contextual relations among various objects. This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate coherent fine-grained semantic features. Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic information. Intriguingly, the model not only encodes the discrete and continuous latent variables of instances but also adeptly captures both local and global contextual features, thereby significantly enhancing the richness and diversity of semantic representations. Leveraging the spatiotemporal properties inherent in SNNs, we employ contrastive learning (CL) to enhance the similarity-based representation of embeddings. This strategy alleviates the computational overhead of the model and enriches meaningful visual representations by constructing positive and negative sample pairs. We design an innovative pre-training method, Spiked Text Learning (STL), which uses text features to improve the encoding ability of discrete semantics. Experiments show that the proposed GSHN exhibits promising results on multiple VL downstream tasks.

Paper Structure

This paper contains 17 sections, 15 equations, 4 figures, 13 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the GSHN architecture. We use panoptic segmentation to optimize the fine-grained image semantic representations. We also adopt multi-layer Transformers to apply multimodal connections on pre-training tasks, where the pre-training task follows existing ITM, MLM, and CL works.
  • Figure 2: Information transmission diagram. We encode GAT-based concrete (solid green pathway) and SNN-based discrete (solid blue pathway) visual semantic modules.
  • Figure 3: Performance testing for time window sizes on the VQAv2 dataset.
  • Figure 4: Performance comparison of the proposed GSHN with and without semantic memory unit (SMU).