Table of Contents
Fetching ...

15M Multimodal Facial Image-Text Dataset

Dawei Dai, YuTang Li, YingGe Liu, Mingming Jia, Zhang YuanHui, Guoyin Wang

TL;DR

FaceCaption-15M addresses the scarcity of large-scale, high-quality facial image-text data by constructing a 15M-aligned facial image-caption corpus using LAION-Face as raw input, RetinaFace-based cropping, automatic 40-attribute annotation, and grammar template generation followed by LLM rewriting to produce natural captions. The authors introduce FLIP, a CLIP-like multimodal model trained on FaceCaption-15M, coupling a ViT-B/16 image encoder with a BERT-base text encoder and optimizing with ITC and ITM losses to align facial images with captions. Comprehensive analyses show FaceCaption-15M delivers superior image quality, richer and more diverse captions, and stronger text-image relevance than several existing datasets, enabling state-of-the-art performance on face-centered tasks. The work demonstrates the practical value of high-quality, large-scale face-image text data for improving retrieval, recognition, and attribute-prediction tasks, and provides open access to data, code, and models, while acknowledging ethical considerations and the potential for misuse in applications like deepfakes.

Abstract

Currently, image-text-driven multi-modal deep learning models have demonstrated their outstanding potential in many fields. In practice, tasks centered around facial images have broad application prospects. This paper presents \textbf{FaceCaption-15M}, a large-scale, diverse, and high-quality dataset of facial images accompanied by their natural language descriptions (facial image-to-text). This dataset aims to facilitate a study on face-centered tasks. FaceCaption-15M comprises over 15 million pairs of facial images and their corresponding natural language descriptions of facial features, making it the largest facial image-caption dataset to date. We conducted a comprehensive analysis of image quality, text naturalness, text complexity, and text-image relevance to demonstrate the superiority of FaceCaption-15M. To validate the effectiveness of FaceCaption-15M, we first trained a facial language-image pre-training model (FLIP, similar to CLIP) to align facial image with its corresponding captions in feature space. Subsequently, using both image and text encoders and fine-tuning only the linear layer, our FLIP-based models achieved state-of-the-art results on two challenging face-centered tasks. The purpose is to promote research in the field of face-related tasks through the availability of the proposed FaceCaption-15M dataset. All data, codes, and models are publicly available. https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M

15M Multimodal Facial Image-Text Dataset

TL;DR

FaceCaption-15M addresses the scarcity of large-scale, high-quality facial image-text data by constructing a 15M-aligned facial image-caption corpus using LAION-Face as raw input, RetinaFace-based cropping, automatic 40-attribute annotation, and grammar template generation followed by LLM rewriting to produce natural captions. The authors introduce FLIP, a CLIP-like multimodal model trained on FaceCaption-15M, coupling a ViT-B/16 image encoder with a BERT-base text encoder and optimizing with ITC and ITM losses to align facial images with captions. Comprehensive analyses show FaceCaption-15M delivers superior image quality, richer and more diverse captions, and stronger text-image relevance than several existing datasets, enabling state-of-the-art performance on face-centered tasks. The work demonstrates the practical value of high-quality, large-scale face-image text data for improving retrieval, recognition, and attribute-prediction tasks, and provides open access to data, code, and models, while acknowledging ethical considerations and the potential for misuse in applications like deepfakes.

Abstract

Currently, image-text-driven multi-modal deep learning models have demonstrated their outstanding potential in many fields. In practice, tasks centered around facial images have broad application prospects. This paper presents \textbf{FaceCaption-15M}, a large-scale, diverse, and high-quality dataset of facial images accompanied by their natural language descriptions (facial image-to-text). This dataset aims to facilitate a study on face-centered tasks. FaceCaption-15M comprises over 15 million pairs of facial images and their corresponding natural language descriptions of facial features, making it the largest facial image-caption dataset to date. We conducted a comprehensive analysis of image quality, text naturalness, text complexity, and text-image relevance to demonstrate the superiority of FaceCaption-15M. To validate the effectiveness of FaceCaption-15M, we first trained a facial language-image pre-training model (FLIP, similar to CLIP) to align facial image with its corresponding captions in feature space. Subsequently, using both image and text encoders and fine-tuning only the linear layer, our FLIP-based models achieved state-of-the-art results on two challenging face-centered tasks. The purpose is to promote research in the field of face-related tasks through the availability of the proposed FaceCaption-15M dataset. All data, codes, and models are publicly available. https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M
Paper Structure (18 sections, 8 figures, 4 tables)

This paper contains 18 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of our proposed FaceCaption-15M containing over 15 million facial image-text (right and left) pairs.
  • Figure 2: Pipeline of our FaceCaption-15M construction process. Pipeline includes face detection and cropping, face annotation, and automatic caption generation.
  • Figure 3: Image quality score distribution. (a) BRISQUE Mittal_Moorthy_Bovik_2012 evaluation with lower scores indicating better image quality; (b) CLIPIQA wang2022exploring evaluation with higher scores indicating better image quality.
  • Figure 4: Text distribution. (a) Distribution of the five categories of annotations in the FaceCaption-15M. (b) The percentage of sentences in the dataset with different word counts. (c) The number of unique 4-grams under the percentage data. (d) Illustrations of image-text pairs LAION-Face and FaceCapition-15M. FaceCaption* indicates the caption generated by grammatical template without using LLM.
  • Figure 5: Image-text matching score. We adopt the ITM score to measure image-text correlation of different datasets.
  • ...and 3 more figures