Table of Contents
Fetching ...

Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing

Xun Lin, Shuai Wang, Yi Yu, Zitong Yu, Jiale Zhou, Yizhong Liu, Xiaochun Cao, Alex Kot, Yefeng Zheng

TL;DR

Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance and introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space.

Abstract

Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.

Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing

TL;DR

Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance and introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space.

Abstract

Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.

Paper Structure

This paper contains 45 sections, 10 theorems, 88 equations, 5 figures, 8 tables.

Key Result

Lemma 1

With probability at least $1\!-\!\delta$, for a hypothesis $h$ sampled from the posterior distribution $\mathcal{P}$, its true risk $\mathcal{R}_{\mathcal{T}}(h)$ can be bounded by the empirical risk $\hat{\mathcal{R}}_{\mathcal{S}}$ and a KL-divergence term as follows: where $N$ is the number of training samples, $\Pi$ is the prior distribution, and $\mathrm{KL}(\mathcal{P} \,\|\, \Pi)$ is the K

Figures (5)

  • Figure 1: Illustration of our decomposition of the multimodal generalization risk into two trainable invariant risks. (a) The modal representation invariant risk (Risk 1) arises when unimodal representations learned on a source domain ($\mathcal{S}$) fail to generalize to a target domain ($\mathcal{T}$) due to a large domain shift. (b) The modal synergy invariant risk (Risk 2) occurs when a spurious cross-modal correlation (synergy) learned in $\mathcal{S}$ proves invalid in $\mathcal{T}$, leading to shortcut-based prediction errors.
  • Figure 2: Framework of our proposed RiSe: (a) Overall end-to-end architecture, where modality features are optimized by our two core modules, AsyIRM and MMSD; (b) AsyIRM learns a disentangled embedding by using feature norms and domain-invariant radius for the live/spoof classification, and feature directions for a domain-separating angular push; (c) MMSD disrupts spurious cross-modal correlations through pretext tasks based on cross-sample frequency mixing and spatial token shuffling. For better clarity, we only show the scenario of RGB-Depth FAS here.
  • Figure 3: Visualization of features learned by AsyIRM across different LOO protocols. AsyIRM disentangles domain (encoded in angle) and liveness (encoded in norm) information in the embedding space.
  • Figure 4: Hyperparameter analysis of loss weights in Eq. \ref{['eq:loss-weight']}. We evaluate the impact of the weights for (a) $\mathcal{L}_{\mathrm{ang}}$, (b) $\mathcal{L}_{\mathrm{IRM}}$, (c) $\mathcal{L}_{\mathrm{MMSD}}$, and (d) $\mathcal{L}_{\mathrm{aux}}$.
  • Figure 5: Hyperparameter analysis on the initialization value $r\!=\!\varphi(s)$ for the radial classifier. Performance is measured in (a) HTER (lower is better) and (b) AUC (higher is better).

Theorems & Definitions (22)

  • Definition 1: Empirical Risk Minimization, ERM
  • Definition 2: Invariant Risk Minimization, IRM
  • Definition 3: Unimodal Generalization Risk
  • Definition 4: Multimodal Generalization Risk
  • Lemma 1: PAC-Bayes Generalization Error Bound pac-bayes
  • Proposition 1: KL Divergence Term of Vanilla IRM
  • Proposition 2: KL Divergence Term in Multimodal IRM
  • Proposition 3: KL Divergence Term of AsyIRM
  • Theorem 1: AsyIRM Achieves a Tighter Generalization Error Upper Bound
  • Proposition 4: MMSD Reduces Modal Synergy Risk
  • ...and 12 more