Table of Contents
Fetching ...

MSConv: Multiplicative and Subtractive Convolution for Face Recognition

Si Zhou, Yain-Whar Si, Xiaochen Yuan, Xiaofan Li, Xiaoxiang Liu, Xinyuan Zhang, Cong Lin, Xueyuan Gong

TL;DR

The paper tackles the limitation of traditional feature fusion in face recognition, which overemphasizes salient features and neglects differential features. It introduces MSConv, a plug-in module that blends multi-scale mixed convolution with Multiplication Operation (MO) and Subtraction Operation (SO) to jointly learn salient and differential cues. Empirical results on the MS1MV3 backbone demonstrate state-of-the-art or competitive performance across eight benchmarks, with ablation and visualization analyses confirming the complementary roles of MO and SO. The approach offers a practical, architecture-agnostic enhancement for robust face recognition and has potential for broader applications in vision tasks requiring nuanced feature fusion.

Abstract

In Neural Networks, there are various methods of feature fusion. Different strategies can significantly affect the effectiveness of feature representation, consequently influencing the ability of model to extract representative and discriminative features. In the field of face recognition, traditional feature fusion methods include feature concatenation and feature addition. Recently, various attention mechanism-based fusion strategies have emerged. However, we found that these methods primarily focus on the important features in the image, referred to as salient features in this paper, while neglecting another equally important set of features for image recognition tasks, which we term differential features. This may cause the model to overlook critical local differences when dealing with complex facial samples. Therefore, in this paper, we propose an efficient convolution module called MSConv (Multiplicative and Subtractive Convolution), designed to balance the learning of model about salient and differential features. Specifically, we employ multi-scale mixed convolution to capture both local and broader contextual information from face images, and then utilize Multiplication Operation (MO) and Subtraction Operation (SO) to extract salient and differential features, respectively. Experimental results demonstrate that by integrating both salient and differential features, MSConv outperforms models that only focus on salient features.

MSConv: Multiplicative and Subtractive Convolution for Face Recognition

TL;DR

The paper tackles the limitation of traditional feature fusion in face recognition, which overemphasizes salient features and neglects differential features. It introduces MSConv, a plug-in module that blends multi-scale mixed convolution with Multiplication Operation (MO) and Subtraction Operation (SO) to jointly learn salient and differential cues. Empirical results on the MS1MV3 backbone demonstrate state-of-the-art or competitive performance across eight benchmarks, with ablation and visualization analyses confirming the complementary roles of MO and SO. The approach offers a practical, architecture-agnostic enhancement for robust face recognition and has potential for broader applications in vision tasks requiring nuanced feature fusion.

Abstract

In Neural Networks, there are various methods of feature fusion. Different strategies can significantly affect the effectiveness of feature representation, consequently influencing the ability of model to extract representative and discriminative features. In the field of face recognition, traditional feature fusion methods include feature concatenation and feature addition. Recently, various attention mechanism-based fusion strategies have emerged. However, we found that these methods primarily focus on the important features in the image, referred to as salient features in this paper, while neglecting another equally important set of features for image recognition tasks, which we term differential features. This may cause the model to overlook critical local differences when dealing with complex facial samples. Therefore, in this paper, we propose an efficient convolution module called MSConv (Multiplicative and Subtractive Convolution), designed to balance the learning of model about salient and differential features. Specifically, we employ multi-scale mixed convolution to capture both local and broader contextual information from face images, and then utilize Multiplication Operation (MO) and Subtraction Operation (SO) to extract salient and differential features, respectively. Experimental results demonstrate that by integrating both salient and differential features, MSConv outperforms models that only focus on salient features.

Paper Structure

This paper contains 20 sections, 9 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Four feature fusion methods. The symbol "+" denotes element-wise addition, "$\circledast$" denotes element-wise multiplication, "--" denotes element-wise subtraction, and "ⓒ" denotes channel-wise concatenation of feature maps. Here, $u$, $g$, $v$, and $t$ represent the fused feature maps. Note that only the feature concatenation operation increases the channel dimension.
  • Figure 2: Selective Kernel Convolution.
  • Figure 3: Two different attention mechanisms and the insertion points in foundational models (ResNet). C denotes the Channel Attention Module, and S denotes the Spatial Attention Module.
  • Figure 4: The architecture of MSConv.
  • Figure 5: Vector Addition Operations and Subtraction Operations. The dashed lines in the figure are auxiliary lines.
  • ...and 3 more figures