15M Multimodal Facial Image-Text Dataset
Dawei Dai, YuTang Li, YingGe Liu, Mingming Jia, Zhang YuanHui, Guoyin Wang
TL;DR
FaceCaption-15M addresses the scarcity of large-scale, high-quality facial image-text data by constructing a 15M-aligned facial image-caption corpus using LAION-Face as raw input, RetinaFace-based cropping, automatic 40-attribute annotation, and grammar template generation followed by LLM rewriting to produce natural captions. The authors introduce FLIP, a CLIP-like multimodal model trained on FaceCaption-15M, coupling a ViT-B/16 image encoder with a BERT-base text encoder and optimizing with ITC and ITM losses to align facial images with captions. Comprehensive analyses show FaceCaption-15M delivers superior image quality, richer and more diverse captions, and stronger text-image relevance than several existing datasets, enabling state-of-the-art performance on face-centered tasks. The work demonstrates the practical value of high-quality, large-scale face-image text data for improving retrieval, recognition, and attribute-prediction tasks, and provides open access to data, code, and models, while acknowledging ethical considerations and the potential for misuse in applications like deepfakes.
Abstract
Currently, image-text-driven multi-modal deep learning models have demonstrated their outstanding potential in many fields. In practice, tasks centered around facial images have broad application prospects. This paper presents \textbf{FaceCaption-15M}, a large-scale, diverse, and high-quality dataset of facial images accompanied by their natural language descriptions (facial image-to-text). This dataset aims to facilitate a study on face-centered tasks. FaceCaption-15M comprises over 15 million pairs of facial images and their corresponding natural language descriptions of facial features, making it the largest facial image-caption dataset to date. We conducted a comprehensive analysis of image quality, text naturalness, text complexity, and text-image relevance to demonstrate the superiority of FaceCaption-15M. To validate the effectiveness of FaceCaption-15M, we first trained a facial language-image pre-training model (FLIP, similar to CLIP) to align facial image with its corresponding captions in feature space. Subsequently, using both image and text encoders and fine-tuning only the linear layer, our FLIP-based models achieved state-of-the-art results on two challenging face-centered tasks. The purpose is to promote research in the field of face-related tasks through the availability of the proposed FaceCaption-15M dataset. All data, codes, and models are publicly available. https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M
