Table of Contents
Fetching ...

Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval

Xintao Zong, Xian Zhong, Wenxuan Liu, Jianhao Ding, Zhaofei Yu, Tiejun Huang

Abstract

Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.

Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval

Abstract

Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.

Paper Structure

This paper contains 45 sections, 27 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Brain-Inspired Multimodal SNN. Sensory-specific cortical regions process unimodal information, while higher-order areas such as the audiovisual (AV) cortex integrate multimodal semantics. This hierarchical mechanism inspires our spike-level fusion strategy for constructing a multimodal SNN framework.
  • Figure 2: Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval. (a) Overview of CMSF. (b) Working principle of the Spike Generator. (c,d) Details of Intra-modal Attention blocks. (e) One implementation of the Spike Fusion module. Pre-extracted region and word features are converted into spikes for unimodal semantic modeling within a sparse embedding space. The Spike Fusion module performs spike-level cross-modal interaction, generating enhanced embeddings as soft labels to guide unimodal encoders. This stage is excluded during inference. Bidirectional Hard-Alignment computes the fine-grained similarity matrix, while Early-Alignment and Soft-Label Alignment optimize unimodal encoders from both input and output perspectives. Spike Comb Cross Attention (SCCA) is detailed in §\ref{['subsec:3.5']}, with additional fusion variants in the Supplementary Material A.
  • Figure 3: Illustration of Our Bidirectional Hard Alignment. It identifies fine-grained region-word hard matches and integrates them to compute the overall image-text similarity.
  • Figure 4: Details of Our Spike Comb Cross Attention Structure. With the mask operation, it achieves lower time complexity and reduced energy consumption.
  • Figure 5: Illustration of Spike Fusion Soft-Label Alignment Strategy. The left part represents the similarity computation involved in both training and inference; the right part is training-only.
  • ...and 4 more figures