Balance-aware Sequence Sampling Makes Multi-modal Learning Better
Zhi-Hao Guan
TL;DR
This work tackles modality imbalance in multi-modal learning by shifting focus from objective-only balance to training sequence balance. It introduces Balance-aware Sequence Sampling (BSS), which uses a multi-perspective measurer to compute a balance score $s(x)$ from prediction similarity and a loss-based criterion, guiding data presentation through two schedulers: a heuristic CL-inspired pacing function and a learning-based dynamic sampler. The approach is validated across six datasets, showing consistent gains over vanilla fusion and state-of-the-art MML baselines, with ablations confirming the value of combining both criteria and adaptive sampling. The method is model-agnostic and can be plugged into existing MML pipelines to improve robustness and performance, particularly in imbalanced cross-modal scenarios.
Abstract
To address the modality imbalance caused by data heterogeneity, existing multi-modal learning (MML) approaches primarily focus on balancing this difference from the perspective of optimization objectives. However, almost all existing methods ignore the impact of sample sequences, i.e., an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample. Via the evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to rebalance MML. Moreover, considering that sample balance may evolve as the model capability increases, we propose a learning-based probabilistic sampling method to dynamically update the training sequences at the epoch level, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) MML approaches.
