Table of Contents
Fetching ...

CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation

Yuntao Shou, Wei Ai, Tao Meng, Nan Yin, Keqin Li

TL;DR

A novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE), which consistently outperforms advanced methods such as LRA-GNN and MCGRL and introduces reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions.

Abstract

The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.

CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation

TL;DR

A novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE), which consistently outperforms advanced methods such as LRA-GNN and MCGRL and introduces reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions.

Abstract

The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.
Paper Structure (24 sections, 20 equations, 10 figures, 1 table)

This paper contains 24 sections, 20 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: We compare the differences between existing image processing paradigms and the paradigm proposed in this paper. As shown in Fig. \ref{['fig:paradigm']}(a), most image processing methods perform supervised learning by inputting images and then using manually annotated labels as supervision signals. As shown in Fig. 1(b), since manual annotation requires a large amount of resources, existing methods begin to build self-supervised learning models by contrasting input images. As shown in Fig. \ref{['fig:paradigm']}(c), we perform text-image contrastive learning by using the CLIP pre-trained model and transfer the learned knowledge to the age estimation prediction task. As shown in Figs \ref{['fig:paradigm']} (d) and (e), existing methods are mainly based on CNN architecture and Transformer architecture based on attention mechanism to extract feature information of images. As shown in Fig. 1(f), we replace the attention module in the Transformer architecture with a Fourier prior module.
  • Figure 2: The overall framework for age prediction using CILF-CIAE. Specifically, we first use CLIP to extract image features and C-type text features, and then calculate the pixel-text similarity score. The similarity scores of the pixel-text pairs are fed into the age estimation module, and the age label is used as a supervision signal. To better utilize the prior knowledge of images, we introduce Fourierformer to extract contextual information in images to prompt the language model. Finally, we perform error optimization on the predicted age.
  • Figure 3: The overall framework of the proposed Fourierformer. FourierFormer includes a spatial interaction module, a channel evolution module, a discrete Fourier transform (DFT) and an inverse discrete Fourier (IDFT) module, which can effectively extract information from the global context of an image.
  • Figure 4: Details of the Fourier Prior Embedding module (FPE). FPE follows the global context information modeling idea of spatial interaction and channel evolution.
  • Figure 5: The flowchart of the correcting inverse age estimation. Existing age estimation models give a first age estimate, which is assessed by evaluations $E_P$. If failed, the optimization branch will be activated. The age estimation error estimated by the ensemble error model is used for training to update the predicted age $x^\ast$. The process terminates until $e(x^\ast) \leq \epsilon$.
  • ...and 5 more figures