Table of Contents
Fetching ...

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

Zequn Yang, Yake Wei, Ce Liang, Di Hu

TL;DR

Inspired by this theoretical finding, a training procedure is introduced called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner.

Abstract

Multi-modal models have shown a promising capability to effectively integrate information from various sources, yet meanwhile, they are found vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions. To counter these perturbations, robust multi-modal representations are highly expected, which are positioned well away from the discriminative multi-modal decision boundary. In this paper, different from conventional empirical studies, we focus on a commonly used joint multi-modal framework and theoretically discover that larger uni-modal representation margins and more reliable integration for modalities are essential components for achieving higher robustness. This discovery can further explain the limitation of multi-modal robustness and the phenomenon that multi-modal models are often vulnerable to attacks on the specific modality. Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective. Inspired by our theoretical finding, we introduce a training procedure called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner. Our method demonstrates substantial improvements in performance and robustness compared with existing methods. Furthermore, our training procedure can be easily extended to enhance other robust training strategies, highlighting its credibility and flexibility.

Quantifying and Enhancing Multi-modal Robustness with Modality Preference

TL;DR

Inspired by this theoretical finding, a training procedure is introduced called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner.

Abstract

Multi-modal models have shown a promising capability to effectively integrate information from various sources, yet meanwhile, they are found vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions. To counter these perturbations, robust multi-modal representations are highly expected, which are positioned well away from the discriminative multi-modal decision boundary. In this paper, different from conventional empirical studies, we focus on a commonly used joint multi-modal framework and theoretically discover that larger uni-modal representation margins and more reliable integration for modalities are essential components for achieving higher robustness. This discovery can further explain the limitation of multi-modal robustness and the phenomenon that multi-modal models are often vulnerable to attacks on the specific modality. Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective. Inspired by our theoretical finding, we introduce a training procedure called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner. Our method demonstrates substantial improvements in performance and robustness compared with existing methods. Furthermore, our training procedure can be easily extended to enhance other robust training strategies, highlighting its credibility and flexibility.
Paper Structure (52 sections, 5 theorems, 45 equations, 8 figures, 11 tables)

This paper contains 52 sections, 5 theorems, 45 equations, 8 figures, 11 tables.

Key Result

Theorem 3.3

Given an input $\bm{x}$ with ground-truth label $y \in [K]$ and the closest label $j \neq y$, $\zeta_{j}^{(m)}(\bm{x}^{(m)})$ as the representation margin for $m$-th modality with Lipschitz constraint $\tau_j^{(m)}$, and the integration factor $c^{(m)}_j$. Define $\bm{x}'$ as the perturbed sample, a

Figures (8)

  • Figure 1: Accuracy of different multi-modal robust training methods compared with Joint Training (JT) baseline under $\ell_2$-PGD attack with a range of radius for modality #$v$ (vision) and #$a$ (audio) respectively on Kinetics Sounds dataset. Results show that all these methods are more vulnerable to attacks on the specific modality #$a$.
  • Figure 2: Illustration of traditional multi-modal joint learning framework baltruvsaitis2018multimodal (left) and our framework introducing orthogonality into each uni-modal classifier (right). Our framework can be easily applied to explicit regularization to achieve larger certified robustness.
  • Figure 3: Evaluation of the ratio of vulnerability indicators between modality #$v$ and #$a$ (preferred). We illustrate the ratio in MMAT, CRMT-AT, and CRMT-AT with only the first training procedure.
  • Figure 4: This figure presents the robustness accuracy against uni-modal attacks with different sizes, where the dotted line signifies the difference in robustness accuracy between two uni-modalities.
  • Figure 5: Ablation studies of our methods on the UCF101 dataset, revealing the effect of each part we introduced.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 3.1
  • Definition 3.2
  • Theorem 3.3
  • Proposition 3.4
  • Theorem 3.5
  • Theorem 7.1
  • proof
  • Theorem 7.2
  • proof