Table of Contents
Fetching ...

Learning an Ensemble Token from Task-driven Priors in Facial Analysis

Sunyong Seo, Semin Kim, Jongha Lee

TL;DR

The paper tackles the challenge of leveraging multiple pre-trained priors for facial analysis without incurring heavy computation. It introduces KT-Adapter, which learns a knowledge token via self-attention over prior embeddings and fuses it with a canonical task while keeping encoders frozen, achieving efficiency and performance gains. Extensive experiments across landmark detection, age estimation, and recognition demonstrate robust improvements with low overhead, supported by ablations on mask strategies and the number of priors. The work highlights both the practical impact for real-time facial analysis and avenues for extending task-prior fusion beyond facial analysis, while noting overfitting as a potential limitation.

Abstract

Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. While advances in backbone architectures have improved over the past decade, combining high-fidelity models often incurs computational costs on feature representation perspective. In this work, we introduce KT-Adapter, a novel methodology for learning knowledge token which enables the integration of high-fidelity feature representation in computationally efficient manner. Specifically, we propose a robust prior unification learning method that generates a knowledge token within a self-attention mechanism, sharing the mutual information across the pre-trained encoders. This knowledge token approach offers high efficiency with negligible computational cost. Our results show improved performance across facial analysis, with statistically significant enhancements observed in the feature representations.

Learning an Ensemble Token from Task-driven Priors in Facial Analysis

TL;DR

The paper tackles the challenge of leveraging multiple pre-trained priors for facial analysis without incurring heavy computation. It introduces KT-Adapter, which learns a knowledge token via self-attention over prior embeddings and fuses it with a canonical task while keeping encoders frozen, achieving efficiency and performance gains. Extensive experiments across landmark detection, age estimation, and recognition demonstrate robust improvements with low overhead, supported by ablations on mask strategies and the number of priors. The work highlights both the practical impact for real-time facial analysis and avenues for extending task-prior fusion beyond facial analysis, while noting overfitting as a potential limitation.

Abstract

Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. While advances in backbone architectures have improved over the past decade, combining high-fidelity models often incurs computational costs on feature representation perspective. In this work, we introduce KT-Adapter, a novel methodology for learning knowledge token which enables the integration of high-fidelity feature representation in computationally efficient manner. Specifically, we propose a robust prior unification learning method that generates a knowledge token within a self-attention mechanism, sharing the mutual information across the pre-trained encoders. This knowledge token approach offers high efficiency with negligible computational cost. Our results show improved performance across facial analysis, with statistically significant enhancements observed in the feature representations.

Paper Structure

This paper contains 16 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The key concept of our proposed approach. Our method leverages pre-trained priors from a frozen encoder and unifies them into a learnable token, called knowledge token.
  • Figure 2: Knowledge learning typically train (red) pre-trained (blue) priors within a fusion module, while ControlNet incorporates an auxiliary lightweight (dashed-line) encoder. In contrast, KT-Adapter implicitly (grey) learns encoder priors within its fusion module, thereby reducing computational cost dramatically.
  • Figure 3: Training phase of proposed method. Within the designated blue block, all encoders $E$ are maintained as frozen layers, indicating that they are not subject to parameter updates during training. Both the KT-Adapter and the canonical branch decoder $d_c$ represent the trainable layer within the architecture. In the green block, decoders belonging to the set $D \setminus \{d_c\}$, which correspond to other tasks distinct from the canonical task, are excluded from the forward pass graph. Furthermore, The existence of the ground truth set $Y \setminus \{y_c\}$, denoted as dashed line, is contingent upon the dataset, exhibiting potential variability in its presence; exceptionally, the $y_c$ associated with the canonical branch is an indispensable element.
  • Figure 4: Detailed structure of KT-Adapter. The KT-Adapter is trained on a restricted information $T_{\text{mask}}$ to enrich the $e_c$ in Fig. \ref{['fig:main-structure']}. During the training phase, the MHSA layer processes $T_{\text{mask}}$, which has been subjected to subsampling via the drop mask. During inference, the MHSA layer only processes $\{t_k, t_c\}$, and the canonical mask is omitted.
  • Figure 5: Inference phase of decoupled encoder. The architectural parameters of all layers are maintained consistently with those employed during the training phase. Only the difference lies in the input provided to the KT-Adapter module, which is exclusively the canonical token.
  • ...and 2 more figures