Table of Contents
Fetching ...

Buffer replay enhances the robustness of multimodal learning under missing-modality

Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang

TL;DR

This work tackles the challenge of missing modalities in multimodal learning by introducing REP, a lightweight framework that caches early-layer representations in private and shared buffers, replays them across deeper layers, and uses dynamic initialization to enhance generalization. Through residual bypass updates and orthogonality constraints, REP preserves modality-specific signals while maintaining cross-modal semantics, enabling robust performance when information from some modalities is unavailable. Extensive experiments across vision-language, vision-language-audio, and temporal multimodal tasks demonstrate that REP consistently outperforms prior prompt-based and generative approaches, often with negligible parameter overhead. The results establish REP as a practical, robust approach for real-world multimodal systems facing incomplete modality availability.

Abstract

Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.

Buffer replay enhances the robustness of multimodal learning under missing-modality

TL;DR

This work tackles the challenge of missing modalities in multimodal learning by introducing REP, a lightweight framework that caches early-layer representations in private and shared buffers, replays them across deeper layers, and uses dynamic initialization to enhance generalization. Through residual bypass updates and orthogonality constraints, REP preserves modality-specific signals while maintaining cross-modal semantics, enabling robust performance when information from some modalities is unavailable. Extensive experiments across vision-language, vision-language-audio, and temporal multimodal tasks demonstrate that REP consistently outperforms prior prompt-based and generative approaches, often with negligible parameter overhead. The results establish REP as a practical, robust approach for real-world multimodal systems facing incomplete modality availability.

Abstract

Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.

Paper Structure

This paper contains 24 sections, 11 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Motivation of the proposed work (a) Modality missing may impair the model’s understanding of the task and its ability to correctly recognize categories. (b) Long-distance contextual information is crucial for multimodal emotion recognition; local modeling errors may introduce bias into the global prediction.
  • Figure 2: Workflow of the proposed REP. (1) Missing scenarios are defined, and different missing types are initialized into private or shared buffers. (2) These buffers cache information from the first $k-1$ layers via residual bypass during fine-tuning, and the cached information is replayed in the $k-th$ layer. (3) During inference, the REP fine-tuned model significantly improves recognition accuracy under missing modality conditions.
  • Figure 3: (a) Baseline is the CLIP model, using VIT-B/16 as the visual encoder. (b) Previous works such as MAP, MMP, and DCP use random or zero initialization, treating different missing scenarios as independent inputs and fine-tuning the model with prompts. (c) Our proposed REP adopts task-dependent dynamic initialization, caching feature buffers through residual bypass, storing shared and private features separately, and replaying them in deeper layers. (d) Comparison on MM-IMDb, where the missing rate for text-missing is 70%, and for both-missing, each modality is missing 35% (total missing rate is also 70%).
  • Figure 4: Modal missing tests on acoustic-seismic temporal sensor modalities.
  • Figure 5: Ablation study on REP buffer configuration. Results are under 70% missing rate. From left to right are the text-missing, image-missing, and both-missing settings, respectively.
  • ...and 5 more figures