Table of Contents
Fetching ...

Memory-Inspired Temporal Prompt Interaction for Text-Image Classification

Xinyao Yu, Hao Sun, Ziwei Niu, Rui Qin, Zhenjia Bai, Yen-Wei Chen, Lanfen Lin

TL;DR

MITP introduces memory-inspired temporal prompts to enable efficient text-image interaction on intermediate layers of a frozen foundation model, using a memory hub to consolidate and activate cross-modal information with a small set of trainable parameters (~2.0M). The approach achieves competitive accuracy on UPMC-Food101, MM-IMDB, and SNLI-VE, outperforming many prompt-based methods while maintaining low memory usage and parameter counts. By combining temporal prompts with similarity-based prompt generation, MITP facilitates two-way modality exchange without fine-tuning the backbone, offering a practical solution for efficient multimodal transfer learning. The work highlights a promising direction for memory-inspired, prompt-based cross-modal interaction in image-text classification and beyond.

Abstract

In recent years, large-scale pre-trained multimodal models (LMM) generally emerge to integrate the vision and language modalities, achieving considerable success in various natural language processing and computer vision tasks. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this contex, we propose a novel prompt-based multimodal interaction strategy inspired by human memory strategy, namely Memory-Inspired Temporal Prompt Interaction (MITP). Our proposed method involves in two stages as in human memory strategy: the acquiring stage, and the consolidation and activation stage. We utilize temporal prompts on intermediate layers to imitate the acquiring stage, leverage similarity-based prompt interaction to imitate memory consolidation, and employ prompt generation strategy to imitate memory activation. The main strength of our paper is that we interact the prompt vectors on intermediate layers to leverage sufficient information exchange between modalities, with compressed trainable parameters and memory usage. We achieve competitive results on several datasets with relatively small memory usage and 2.0M of trainable parameters (about 1% of the pre-trained foundation model).

Memory-Inspired Temporal Prompt Interaction for Text-Image Classification

TL;DR

MITP introduces memory-inspired temporal prompts to enable efficient text-image interaction on intermediate layers of a frozen foundation model, using a memory hub to consolidate and activate cross-modal information with a small set of trainable parameters (~2.0M). The approach achieves competitive accuracy on UPMC-Food101, MM-IMDB, and SNLI-VE, outperforming many prompt-based methods while maintaining low memory usage and parameter counts. By combining temporal prompts with similarity-based prompt generation, MITP facilitates two-way modality exchange without fine-tuning the backbone, offering a practical solution for efficient multimodal transfer learning. The work highlights a promising direction for memory-inspired, prompt-based cross-modal interaction in image-text classification and beyond.

Abstract

In recent years, large-scale pre-trained multimodal models (LMM) generally emerge to integrate the vision and language modalities, achieving considerable success in various natural language processing and computer vision tasks. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this contex, we propose a novel prompt-based multimodal interaction strategy inspired by human memory strategy, namely Memory-Inspired Temporal Prompt Interaction (MITP). Our proposed method involves in two stages as in human memory strategy: the acquiring stage, and the consolidation and activation stage. We utilize temporal prompts on intermediate layers to imitate the acquiring stage, leverage similarity-based prompt interaction to imitate memory consolidation, and employ prompt generation strategy to imitate memory activation. The main strength of our paper is that we interact the prompt vectors on intermediate layers to leverage sufficient information exchange between modalities, with compressed trainable parameters and memory usage. We achieve competitive results on several datasets with relatively small memory usage and 2.0M of trainable parameters (about 1% of the pre-trained foundation model).
Paper Structure (20 sections, 9 equations, 13 figures, 3 tables)

This paper contains 20 sections, 9 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 2: The pipeline of our proposed method. We utilize pre-trained foundation model in frozen for basic feature extraction of image and text branches, and leverage temporal prompts on intermediate layers to store information on temporal layer and act as media for information exchange. Temporal prompts of different modalities are blended in memory hub to generate prompts for the next layer. Only the prompts and the memory hub are trainable, requiring backward propagation; while the pre-trained foundation model of model is in frozen, which do not participate in backward propagation.
  • Figure : (a) The first proposed prompt-based interaction strategy.liang2022modular
  • Figure : (a) Comparisons on overall efficiency.
  • Figure : (a) The results on interaction layers with interval=1.
  • Figure : (a) The first proposed prompt-based interaction strategy.liang2022modular
  • ...and 8 more figures