Table of Contents
Fetching ...

MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance

Yake Wei, Di Hu

TL;DR

MMPareto tackles gradient conflicts in multitask-like multimodal learning by jointly considering gradient direction and magnitude to provide innocent unimodal assistance. It identifies that conventional Pareto integration can worsen generalization in multimodal settings due to mismatched gradient magnitudes and covariances, and proposes a two-pronged MMPareto update that yields a common descent direction with amplified SGD noise. The approach is supported by theoretical arguments and empirical evidence across audio-visual datasets and both CNN and transformer backbones, showing improvements in multimodal and unimodal performance and extending to multi-task scenarios. The work advances robust, scalable training for dense cross-modal models and offers a foundation for gradient-aware strategies in complex multimodal objectives.

Abstract

Multimodal learning methods with targeted unimodal learning objectives have exhibited their superior efficacy in alleviating the imbalanced multimodal learning problem. However, in this paper, we identify the previously ignored gradient conflict between multimodal and unimodal learning objectives, potentially misleading the unimodal encoder optimization. To well diminish these conflicts, we observe the discrepancy between multimodal loss and unimodal loss, where both gradient magnitude and covariance of the easier-to-learn multimodal loss are smaller than the unimodal one. With this property, we analyze Pareto integration under our multimodal scenario and propose MMPareto algorithm, which could ensure a final gradient with direction that is common to all learning objectives and enhanced magnitude to improve generalization, providing innocent unimodal assistance. Finally, experiments across multiple types of modalities and frameworks with dense cross-modal interaction indicate our superior and extendable method performance. Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty, demonstrating its ideal scalability. The source code and dataset are available at https://github.com/GeWu-Lab/MMPareto_ICML2024.

MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance

TL;DR

MMPareto tackles gradient conflicts in multitask-like multimodal learning by jointly considering gradient direction and magnitude to provide innocent unimodal assistance. It identifies that conventional Pareto integration can worsen generalization in multimodal settings due to mismatched gradient magnitudes and covariances, and proposes a two-pronged MMPareto update that yields a common descent direction with amplified SGD noise. The approach is supported by theoretical arguments and empirical evidence across audio-visual datasets and both CNN and transformer backbones, showing improvements in multimodal and unimodal performance and extending to multi-task scenarios. The work advances robust, scalable training for dense cross-modal models and offers a foundation for gradient-aware strategies in complex multimodal objectives.

Abstract

Multimodal learning methods with targeted unimodal learning objectives have exhibited their superior efficacy in alleviating the imbalanced multimodal learning problem. However, in this paper, we identify the previously ignored gradient conflict between multimodal and unimodal learning objectives, potentially misleading the unimodal encoder optimization. To well diminish these conflicts, we observe the discrepancy between multimodal loss and unimodal loss, where both gradient magnitude and covariance of the easier-to-learn multimodal loss are smaller than the unimodal one. With this property, we analyze Pareto integration under our multimodal scenario and propose MMPareto algorithm, which could ensure a final gradient with direction that is common to all learning objectives and enhanced magnitude to improve generalization, providing innocent unimodal assistance. Finally, experiments across multiple types of modalities and frameworks with dense cross-modal interaction indicate our superior and extendable method performance. Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty, demonstrating its ideal scalability. The source code and dataset are available at https://github.com/GeWu-Lab/MMPareto_ICML2024.
Paper Structure (26 sections, 13 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a). Cosine similarity between multimodal and unimodal gradients in the video encoder of Kinetics Sounds dataset arandjelovic2017look. (b). Methods performance on the multi-task dataset, Cityscapes cordts2016cityscapes. Results are from sener2018multi(c). Methods performance of multimodal and unimodal prediction in the video encoder of Kinetics Sounds. Single loss is the result of the individually trained model with one corresponding learning objective. (d). The gradient magnitude distribution for a fixed video encoder of Kinetics Sounds dataset. Each count is a mini-batch of SGD optimization. Uniform baseline is a basic way where all losses are equally summed without special integration.
  • Figure 2: Illustration of multimodal framework and gradient integration strategy of our MMPareto.
  • Figure 3: Value of $\frac{3k-1}{2k+2}$ varies with $k$.
  • Figure 4: (a). Cosine similarity between gradients of multimodal and unimodal loss in the video encoder of CREMA-D. (b). Gradient magnitude distribution in the video encoder of CREMA-D.
  • Figure 5: Visualization of the loss landscape and corresponding accuracy of uniform baseline, conventional Pareto and our MMPareto methods. Our MMPareto method brings flatter minima. The visualization method is from li2018visualizing. Uniform baseline is a basic way where all losses are equally summed without special integration.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Remark 1
  • Remark 2
  • Remark 2
  • proof