Table of Contents
Fetching ...

Balance-aware Sequence Sampling Makes Multi-modal Learning Better

Zhi-Hao Guan

TL;DR

This work tackles modality imbalance in multi-modal learning by shifting focus from objective-only balance to training sequence balance. It introduces Balance-aware Sequence Sampling (BSS), which uses a multi-perspective measurer to compute a balance score $s(x)$ from prediction similarity and a loss-based criterion, guiding data presentation through two schedulers: a heuristic CL-inspired pacing function and a learning-based dynamic sampler. The approach is validated across six datasets, showing consistent gains over vanilla fusion and state-of-the-art MML baselines, with ablations confirming the value of combining both criteria and adaptive sampling. The method is model-agnostic and can be plugged into existing MML pipelines to improve robustness and performance, particularly in imbalanced cross-modal scenarios.

Abstract

To address the modality imbalance caused by data heterogeneity, existing multi-modal learning (MML) approaches primarily focus on balancing this difference from the perspective of optimization objectives. However, almost all existing methods ignore the impact of sample sequences, i.e., an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample. Via the evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to rebalance MML. Moreover, considering that sample balance may evolve as the model capability increases, we propose a learning-based probabilistic sampling method to dynamically update the training sequences at the epoch level, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) MML approaches.

Balance-aware Sequence Sampling Makes Multi-modal Learning Better

TL;DR

This work tackles modality imbalance in multi-modal learning by shifting focus from objective-only balance to training sequence balance. It introduces Balance-aware Sequence Sampling (BSS), which uses a multi-perspective measurer to compute a balance score from prediction similarity and a loss-based criterion, guiding data presentation through two schedulers: a heuristic CL-inspired pacing function and a learning-based dynamic sampler. The approach is validated across six datasets, showing consistent gains over vanilla fusion and state-of-the-art MML baselines, with ablations confirming the value of combining both criteria and adaptive sampling. The method is model-agnostic and can be plugged into existing MML pipelines to improve robustness and performance, particularly in imbalanced cross-modal scenarios.

Abstract

To address the modality imbalance caused by data heterogeneity, existing multi-modal learning (MML) approaches primarily focus on balancing this difference from the perspective of optimization objectives. However, almost all existing methods ignore the impact of sample sequences, i.e., an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample. Via the evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to rebalance MML. Moreover, considering that sample balance may evolve as the model capability increases, we propose a learning-based probabilistic sampling method to dynamically update the training sequences at the epoch level, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) MML approaches.
Paper Structure (15 sections, 13 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 13 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: A motivating example of sequence sampling. (a). Traditional vanilla training. It shows that the multi-modal performance fails to outperform the best uni-modal counterpart. (b). Curriculum learning (CL) via training sequence sampling. (c). Comparison of different training paradigms. The results show that CL outperforms the baseline (vanilla training), while anti-CL is inferior to it.
  • Figure 2: Illustration of BSS method. (a). Multi-modal training framework for learning multi-modal representations. (b1) and (b2). Heuristic and learning-based schedulers for sequence sampling.
  • Figure 3: (a) and (b). Comparison with hyper-parameters on CREMA-D dataset. (c) and (d). Robust performance achieved by using the CLIP pre-trained model as encoders.
  • Figure 4: Qualitative results of sample evaluation. We show some representative samples selected from different segments (early, middle, and late) of the evaluation sequence.