Restricted Receptive Fields for Face Verification
Kagan Ozturk, Aman Bhatta, Haiyu Wu, Patrick Flynn, Kevin W. Bowyer
TL;DR
This work tackles the interpretability challenge in face verification by shifting from a single holistic representation to a patch-based similarity that sums local contributions from restricted receptive fields. It introduces two approaches: region-based patch representations with learned region weights and the RRFNet, which aggregates patch features via mean pooling and computes cosine similarity between global patch-mean representations. Across seven benchmark datasets, the 56×56 patch configuration often matches or exceeds full-image methods, while 28×28 patches remain competitive in certain settings, all while providing inherent, patch-level explanations. The approach demonstrates that constraining receptive fields can preserve or improve accuracy and yields clearer, inherently interpretable decision processes without post-hoc explanations, with RRFNet-56 delivering the strongest performance overall.
Abstract
Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model's actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.
