Buffer replay enhances the robustness of multimodal learning under missing-modality
Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang
TL;DR
This work tackles the challenge of missing modalities in multimodal learning by introducing REP, a lightweight framework that caches early-layer representations in private and shared buffers, replays them across deeper layers, and uses dynamic initialization to enhance generalization. Through residual bypass updates and orthogonality constraints, REP preserves modality-specific signals while maintaining cross-modal semantics, enabling robust performance when information from some modalities is unavailable. Extensive experiments across vision-language, vision-language-audio, and temporal multimodal tasks demonstrate that REP consistently outperforms prior prompt-based and generative approaches, often with negligible parameter overhead. The results establish REP as a practical, robust approach for real-world multimodal systems facing incomplete modality availability.
Abstract
Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.
