FIESTA: Fisher Information-based Efficient Selective Test-time Adaptation
Mohammadmahdi Honarmand, Onur Cezmi Mutlu, Parnian Azizian, Saimourya Surabhi, Dennis P. Wall
TL;DR
This work tackles domain shifts in facial expression recognition by introducing a Fisher-information–driven selective test-time adaptation framework that updates only a small, high-importance subset of parameters during video inference. By combining per-parameter Fisher scores with a temporal smoothing objective, the method achieves robust adaptation while drastically reducing computational overhead. On the AffWild2 benchmark, it yields notable $\mathrm{F1}$ gains (up to $\mathrm{F1}=0.350$) with as few as ~${22,000}$ updated weights, outperforming both the base model and existing TTA approaches. Ablation studies show that updating a minimal, frame-specific set of weights and using 1–3 frames for Fisher scoring are sufficient for substantial performance improvements, underscoring the practicality of this approach for real-world affective computing.
Abstract
Robust facial expression recognition in unconstrained, "in-the-wild" environments remains challenging due to significant domain shifts between training and testing distributions. Test-time adaptation (TTA) offers a promising solution by adapting pre-trained models during inference without requiring labeled test data. However, existing TTA approaches typically rely on manually selecting which parameters to update, potentially leading to suboptimal adaptation and high computational costs. This paper introduces a novel Fisher-driven selective adaptation framework that dynamically identifies and updates only the most critical model parameters based on their importance as quantified by Fisher information. By integrating this principled parameter selection approach with temporal consistency constraints, our method enables efficient and effective adaptation specifically tailored for video-based facial expression recognition. Experiments on the challenging AffWild2 benchmark demonstrate that our approach significantly outperforms existing TTA methods, achieving a 7.7% improvement in F1 score over the base model while adapting only 22,000 parameters-more than 20 times fewer than comparable methods. Our ablation studies further reveal that parameter importance can be effectively estimated from minimal data, with sampling just 1-3 frames sufficient for substantial performance gains. The proposed approach not only enhances recognition accuracy but also dramatically reduces computational overhead, making test-time adaptation more practical for real-world affective computing applications.
