Table of Contents
Fetching ...

FIESTA: Fisher Information-based Efficient Selective Test-time Adaptation

Mohammadmahdi Honarmand, Onur Cezmi Mutlu, Parnian Azizian, Saimourya Surabhi, Dennis P. Wall

TL;DR

This work tackles domain shifts in facial expression recognition by introducing a Fisher-information–driven selective test-time adaptation framework that updates only a small, high-importance subset of parameters during video inference. By combining per-parameter Fisher scores with a temporal smoothing objective, the method achieves robust adaptation while drastically reducing computational overhead. On the AffWild2 benchmark, it yields notable $\mathrm{F1}$ gains (up to $\mathrm{F1}=0.350$) with as few as ~${22,000}$ updated weights, outperforming both the base model and existing TTA approaches. Ablation studies show that updating a minimal, frame-specific set of weights and using 1–3 frames for Fisher scoring are sufficient for substantial performance improvements, underscoring the practicality of this approach for real-world affective computing.

Abstract

Robust facial expression recognition in unconstrained, "in-the-wild" environments remains challenging due to significant domain shifts between training and testing distributions. Test-time adaptation (TTA) offers a promising solution by adapting pre-trained models during inference without requiring labeled test data. However, existing TTA approaches typically rely on manually selecting which parameters to update, potentially leading to suboptimal adaptation and high computational costs. This paper introduces a novel Fisher-driven selective adaptation framework that dynamically identifies and updates only the most critical model parameters based on their importance as quantified by Fisher information. By integrating this principled parameter selection approach with temporal consistency constraints, our method enables efficient and effective adaptation specifically tailored for video-based facial expression recognition. Experiments on the challenging AffWild2 benchmark demonstrate that our approach significantly outperforms existing TTA methods, achieving a 7.7% improvement in F1 score over the base model while adapting only 22,000 parameters-more than 20 times fewer than comparable methods. Our ablation studies further reveal that parameter importance can be effectively estimated from minimal data, with sampling just 1-3 frames sufficient for substantial performance gains. The proposed approach not only enhances recognition accuracy but also dramatically reduces computational overhead, making test-time adaptation more practical for real-world affective computing applications.

FIESTA: Fisher Information-based Efficient Selective Test-time Adaptation

TL;DR

This work tackles domain shifts in facial expression recognition by introducing a Fisher-information–driven selective test-time adaptation framework that updates only a small, high-importance subset of parameters during video inference. By combining per-parameter Fisher scores with a temporal smoothing objective, the method achieves robust adaptation while drastically reducing computational overhead. On the AffWild2 benchmark, it yields notable gains (up to ) with as few as ~ updated weights, outperforming both the base model and existing TTA approaches. Ablation studies show that updating a minimal, frame-specific set of weights and using 1–3 frames for Fisher scoring are sufficient for substantial performance improvements, underscoring the practicality of this approach for real-world affective computing.

Abstract

Robust facial expression recognition in unconstrained, "in-the-wild" environments remains challenging due to significant domain shifts between training and testing distributions. Test-time adaptation (TTA) offers a promising solution by adapting pre-trained models during inference without requiring labeled test data. However, existing TTA approaches typically rely on manually selecting which parameters to update, potentially leading to suboptimal adaptation and high computational costs. This paper introduces a novel Fisher-driven selective adaptation framework that dynamically identifies and updates only the most critical model parameters based on their importance as quantified by Fisher information. By integrating this principled parameter selection approach with temporal consistency constraints, our method enables efficient and effective adaptation specifically tailored for video-based facial expression recognition. Experiments on the challenging AffWild2 benchmark demonstrate that our approach significantly outperforms existing TTA methods, achieving a 7.7% improvement in F1 score over the base model while adapting only 22,000 parameters-more than 20 times fewer than comparable methods. Our ablation studies further reveal that parameter importance can be effectively estimated from minimal data, with sampling just 1-3 frames sufficient for substantial performance gains. The proposed approach not only enhances recognition accuracy but also dramatically reduces computational overhead, making test-time adaptation more practical for real-world affective computing applications.

Paper Structure

This paper contains 18 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of our proposed selective test-time domain adaptation framework using Fisher information. The method consists of three key components: 1) Fisher Scoring Weight Selection (top left), which processes sample frames from a test video through the original model to generate pseudo-labels and compute Fisher importance scores for model parameters; 2) Temporal Smoothing Domain Adaptation (bottom), which selectively updates only the most important weights (highlighted in yellow) by minimizing the difference between original model logits and their temporally smoothed versions using a low-pass filter; and 3) Inference (top right), where the adapted model with updated weights produces final expression predictions.
  • Figure 2: Ablation study on the effect of percentage threshold and frame sampling on adaptation performance. Left: Results when selecting from early layer weights (0.5% of total model weights). Right: Results when selecting from all model weights. The horizontal dashed line indicates base model performance (F1=0.325), with the green region above showing improvement and the red region below showing degradation.