Table of Contents
Fetching ...

Multi-threshold Deep Metric Learning for Facial Expression Recognition

Wenwu Yang, Jinyi Yu, Tuo Chen, Zhenguang Liu, Xun Wang, Jianbing Shen

TL;DR

This work tackles robust facial expression recognition by addressing the sensitivity of triplet loss to the margin threshold. It introduces multi-threshold deep metric learning (Mul-DML), which partitions the embedding into $N$ slices and jointly trains each slice with a differently sampled threshold from $[\tau_{min},\tau_{max}]$, thereby harvesting diverse expression-feature representations. A Dual Triplet Loss is proposed to combat incomplete judgements and accelerate convergence, enabling end-to-end training within the standard triplet-learning framework. Empirical results on CK+, MMI, SFEW, and RAF-DB show consistent improvements over baselines and state-of-the-art methods, with high accuracy on CK+ ($98.49\%$) and strong performance in real-world settings, demonstrating the practical impact of multi-threshold embeddings for FER.

Abstract

Effective expression feature representations generated by a triplet-based deep metric learning are highly advantageous for facial expression recognition (FER). The performance of triplet-based deep metric learning is contingent upon identifying the best threshold for triplet loss. Threshold validation, however, is tough and challenging, as the ideal threshold changes among datasets and even across classes within the same dataset. In this paper, we present the multi-threshold deep metric learning technique, which not only avoids the difficult threshold validation but also vastly increases the capacity of triplet loss learning to construct expression feature representations. We find that each threshold of the triplet loss intrinsically determines a distinctive distribution of inter-class variations and corresponds, thus, to a unique expression feature representation. Therefore, rather than selecting a single optimal threshold from a valid threshold range, we thoroughly sample thresholds across the range, allowing the representation characteristics manifested by thresholds within the range to be fully extracted and leveraged for FER. To realize this approach, we partition the embedding layer of the deep metric learning network into a collection of slices and model training these embedding slices as an end-to-end multi-threshold deep metric learning problem. Each embedding slice corresponds to a sample threshold and is learned by enforcing the corresponding triplet loss, yielding a set of distinct expression features, one for each embedding slice. It makes the embedding layer, which is composed of a set of slices, a more informative and discriminative feature, hence enhancing the FER accuracy. Extensive evaluations demonstrate the superior performance of the proposed approach on both posed and spontaneous facial expression datasets.

Multi-threshold Deep Metric Learning for Facial Expression Recognition

TL;DR

This work tackles robust facial expression recognition by addressing the sensitivity of triplet loss to the margin threshold. It introduces multi-threshold deep metric learning (Mul-DML), which partitions the embedding into slices and jointly trains each slice with a differently sampled threshold from , thereby harvesting diverse expression-feature representations. A Dual Triplet Loss is proposed to combat incomplete judgements and accelerate convergence, enabling end-to-end training within the standard triplet-learning framework. Empirical results on CK+, MMI, SFEW, and RAF-DB show consistent improvements over baselines and state-of-the-art methods, with high accuracy on CK+ () and strong performance in real-world settings, demonstrating the practical impact of multi-threshold embeddings for FER.

Abstract

Effective expression feature representations generated by a triplet-based deep metric learning are highly advantageous for facial expression recognition (FER). The performance of triplet-based deep metric learning is contingent upon identifying the best threshold for triplet loss. Threshold validation, however, is tough and challenging, as the ideal threshold changes among datasets and even across classes within the same dataset. In this paper, we present the multi-threshold deep metric learning technique, which not only avoids the difficult threshold validation but also vastly increases the capacity of triplet loss learning to construct expression feature representations. We find that each threshold of the triplet loss intrinsically determines a distinctive distribution of inter-class variations and corresponds, thus, to a unique expression feature representation. Therefore, rather than selecting a single optimal threshold from a valid threshold range, we thoroughly sample thresholds across the range, allowing the representation characteristics manifested by thresholds within the range to be fully extracted and leveraged for FER. To realize this approach, we partition the embedding layer of the deep metric learning network into a collection of slices and model training these embedding slices as an end-to-end multi-threshold deep metric learning problem. Each embedding slice corresponds to a sample threshold and is learned by enforcing the corresponding triplet loss, yielding a set of distinct expression features, one for each embedding slice. It makes the embedding layer, which is composed of a set of slices, a more informative and discriminative feature, hence enhancing the FER accuracy. Extensive evaluations demonstrate the superior performance of the proposed approach on both posed and spontaneous facial expression datasets.

Paper Structure

This paper contains 30 sections, 11 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: For face images (a), the feature distances may be dominated by the identity of the subjects rather than their expression (top of the right column), a problem that can be resolved with deep metric learning (b). In (c), the case of incomplete judgements is depicted, where the anchor is further from the positive than the negative (the dashed arrows), but the constraint imposed by the triplet loss has already been satisfied (the solid arrows), so incomplete judgements are not penalized.
  • Figure 2: The proposed deep metric learning network for FER. It consists of a batch input layer, a deep CNN (ResNet-18 in our implementation), an embedding layer, and a classification layer. Through a linear layer comprised of an embedding matrix $\mathbf{M}$, an embedding feature for the classification of expressions is learned in the embedding layer.
  • Figure 3: Distributions of inter-class variations with respect to the feature embeddings learned by triplet loss learning, where dual triplet loss is used as the loss function and the original images are samples from the facial expression database SFEW Dhall15.
  • Figure 4: The module of the multi-threshold deep metric learning in which the embedding layer is divided into multiple non-overlapping slices and each slice is formulated as a separate triplet loss learning to produce a unique feature embedding $f_i(x)$ with a specific sample threshold $\tau_i$.
  • Figure 5: Cycle of incomplete judgements during training time. In the cycle, the incomplete judgement of the hard triplet $(A, B, C)$ at the current iteration leads to the incomplete judgement of the hard triplet $(B, A, C)$ at the following iteration, and vice versa.
  • ...and 6 more figures