Table of Contents
Fetching ...

Stage-Adaptive Reliability Modeling for Continuous Valence-Arousal Estimation

Yubeen Lee, Sangeun Lee, Junyeop Cha, Eunil Park

Abstract

Continuous valence-arousal estimation in real-world environments is challenging due to inconsistent modality reliability and interaction-dependent variability in audio-visual signals. Existing approaches primarily focus on modeling temporal dynamics, often overlooking the fact that modality reliability can vary substantially across interaction stages. To address this issue, we propose SAGE, a Stage-Adaptive reliability modeling framework that explicitly estimates and calibrates modality-wise confidence during multimodal integration. SAGE introduces a reliability-aware fusion mechanism that dynamically rebalances audio and visual representations according to their stage-dependent informativeness, preventing unreliable signals from dominating the prediction process. By separating reliability estimation from feature representation, the proposed framework enables more stable emotion estimation under cross-modal noise, occlusion, and varying interaction conditions. Extensive experiments on the Aff-Wild2 benchmark demonstrate that SAGE consistently improves concordance correlation coefficient scores compared with existing multimodal fusion approaches, highlighting the effectiveness of reliability-driven modeling for continuous affect prediction.

Stage-Adaptive Reliability Modeling for Continuous Valence-Arousal Estimation

Abstract

Continuous valence-arousal estimation in real-world environments is challenging due to inconsistent modality reliability and interaction-dependent variability in audio-visual signals. Existing approaches primarily focus on modeling temporal dynamics, often overlooking the fact that modality reliability can vary substantially across interaction stages. To address this issue, we propose SAGE, a Stage-Adaptive reliability modeling framework that explicitly estimates and calibrates modality-wise confidence during multimodal integration. SAGE introduces a reliability-aware fusion mechanism that dynamically rebalances audio and visual representations according to their stage-dependent informativeness, preventing unreliable signals from dominating the prediction process. By separating reliability estimation from feature representation, the proposed framework enables more stable emotion estimation under cross-modal noise, occlusion, and varying interaction conditions. Extensive experiments on the Aff-Wild2 benchmark demonstrate that SAGE consistently improves concordance correlation coefficient scores compared with existing multimodal fusion approaches, highlighting the effectiveness of reliability-driven modeling for continuous affect prediction.
Paper Structure (18 sections, 16 equations, 3 figures, 2 tables)

This paper contains 18 sections, 16 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Temporal reliability varies within modalities due to expressive facial cues and varying speech activity. SAGE adaptively reweights modality contributions over time, leading to stable and accurate VA prediction.
  • Figure 2: Overall architecture of the proposed SAGE framework for continuous VA estimation. Visual and audio features are extracted using pretrained encoders and temporally encoded via TCNs. The fused representation is processed by the SAGE module, which performs reliability-guided fusion and temporal refinement, followed by a regression head for frame-level VA prediction.
  • Figure 3: Detailed architecture of the proposed SAGE module. RGF computes time-dependent reliability scores to adaptively reweight temporal features. The reliability-adjusted representation is then refined by a self-attention-based temporal transformer to capture long-range dependencies before final regression.