Table of Contents
Fetching ...

Restricted Receptive Fields for Face Verification

Kagan Ozturk, Aman Bhatta, Haiyu Wu, Patrick Flynn, Kevin W. Bowyer

TL;DR

This work tackles the interpretability challenge in face verification by shifting from a single holistic representation to a patch-based similarity that sums local contributions from restricted receptive fields. It introduces two approaches: region-based patch representations with learned region weights and the RRFNet, which aggregates patch features via mean pooling and computes cosine similarity between global patch-mean representations. Across seven benchmark datasets, the 56×56 patch configuration often matches or exceeds full-image methods, while 28×28 patches remain competitive in certain settings, all while providing inherent, patch-level explanations. The approach demonstrates that constraining receptive fields can preserve or improve accuracy and yields clearer, inherently interpretable decision processes without post-hoc explanations, with RRFNet-56 delivering the strongest performance overall.

Abstract

Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model's actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.

Restricted Receptive Fields for Face Verification

TL;DR

This work tackles the interpretability challenge in face verification by shifting from a single holistic representation to a patch-based similarity that sums local contributions from restricted receptive fields. It introduces two approaches: region-based patch representations with learned region weights and the RRFNet, which aggregates patch features via mean pooling and computes cosine similarity between global patch-mean representations. Across seven benchmark datasets, the 56×56 patch configuration often matches or exceeds full-image methods, while 28×28 patches remain competitive in certain settings, all while providing inherent, patch-level explanations. The approach demonstrates that constraining receptive fields can preserve or improve accuracy and yields clearer, inherently interpretable decision processes without post-hoc explanations, with RRFNet-56 delivering the strongest performance overall.

Abstract

Understanding how deep neural networks make decisions is crucial for analyzing their behavior and diagnosing failure cases. In computer vision, a common approach to improve interpretability is to assign importance to individual pixels using post-hoc methods. Although they are widely used to explain black-box models, their fidelity to the model's actual reasoning is uncertain due to the lack of reliable evaluation metrics. This limitation motivates an alternative approach, which is to design models whose decision processes are inherently interpretable. To this end, we propose a face similarity metric that breaks down global similarity into contributions from restricted receptive fields. Our method defines the similarity between two face images as the sum of patch-level similarity scores, providing a locally additive explanation without relying on post-hoc analysis. We show that the proposed approach achieves competitive verification performance even with patches as small as 28x28 within 112x112 face images, and surpasses state-of-the-art methods when using 56x56 patches.

Paper Structure

This paper contains 9 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of the traditional (top) and the proposed approaches (bottom). In the traditional approach, face similarity is measured using a single global representation. Because feature extraction relies on black-box models, the resulting similarity score offers no insight into the decision process. In contrast, our approach extracts representations from restricted receptive fields and computes the overall similarity score as the sum of local similarities, enhancing human understanding through patch-level decomposition.
  • Figure 2: Visualization of patch-level similarities for two image pairs computed using RRFNet-28. Each row presents a pair of corresponding patches (Patch A and Patch B) from Image A and Image B, along with their similarity scores. The overall face similarity score is obtained by aggregating the scores from all patch pairs. The heatmaps at the bottom illustrate the spatial distribution of patch similarities on Image A.
  • Figure : Restricted receptive fields of sizes (a) $28 \times 28$ and (b) $56 \times 56$ are shown for a given (c) $112 \times 112$ face image. The top-left coordinates of each image patch are indicated below the corresponding patch. For RRFNet-28, the four patches at the corners are excluded, while for RRFNet-56, one patch at each corner is excluded.