Table of Contents
Fetching ...

FaceGemma: Enhancing Image Captioning with Facial Attributes for Portrait Images

Naimul Haque, Iffat Labiba, Sadia Akter

TL;DR

FaceGemma targets portrait image captioning by integrating facial attributes into the description process. The method constructs a dedicated attribute-rich dataset via Llama3 70B and then fine-tunes PaliGemma to align captions with nuanced facial features. Quantitative results show improvements across BLEU and METEOR metrics, confirming the value of attribute-aware descriptions for portrait images. The work introduces FaceAttDB and a novel prompt-driven pipeline that broadens the applicability of captioning in accessibility and multilingual contexts.

Abstract

Automated image caption generation is essential for improving the accessibility and understanding of visual content. In this study, we introduce FaceGemma, a model that accurately describes facial attributes such as emotions, expressions, and features. Using FaceAttdb data, we generated descriptions for 2000 faces with the Llama 3 - 70B model and fine-tuned the PaliGemma model with these descriptions. Based on the attributes and captions supplied in FaceAttDB, we created a new description dataset where each description perfectly depicts the human-annotated attributes, including key features like attractiveness, full lips, big nose, blond hair, brown hair, bushy eyebrows, eyeglasses, male, smile, and youth. This detailed approach ensures that the generated descriptions are closely aligned with the nuanced visual details present in the images. Our FaceGemma model leverages an innovative approach to image captioning by using annotated attributes, human-annotated captions, and prompt engineering to produce high-quality facial descriptions. Our method significantly improved caption quality, achieving an average BLEU-1 score of 0.364 and a METEOR score of 0.355. These metrics demonstrate the effectiveness of incorporating facial attributes into image captioning, providing more accurate and descriptive captions for portrait images.

FaceGemma: Enhancing Image Captioning with Facial Attributes for Portrait Images

TL;DR

FaceGemma targets portrait image captioning by integrating facial attributes into the description process. The method constructs a dedicated attribute-rich dataset via Llama3 70B and then fine-tunes PaliGemma to align captions with nuanced facial features. Quantitative results show improvements across BLEU and METEOR metrics, confirming the value of attribute-aware descriptions for portrait images. The work introduces FaceAttDB and a novel prompt-driven pipeline that broadens the applicability of captioning in accessibility and multilingual contexts.

Abstract

Automated image caption generation is essential for improving the accessibility and understanding of visual content. In this study, we introduce FaceGemma, a model that accurately describes facial attributes such as emotions, expressions, and features. Using FaceAttdb data, we generated descriptions for 2000 faces with the Llama 3 - 70B model and fine-tuned the PaliGemma model with these descriptions. Based on the attributes and captions supplied in FaceAttDB, we created a new description dataset where each description perfectly depicts the human-annotated attributes, including key features like attractiveness, full lips, big nose, blond hair, brown hair, bushy eyebrows, eyeglasses, male, smile, and youth. This detailed approach ensures that the generated descriptions are closely aligned with the nuanced visual details present in the images. Our FaceGemma model leverages an innovative approach to image captioning by using annotated attributes, human-annotated captions, and prompt engineering to produce high-quality facial descriptions. Our method significantly improved caption quality, achieving an average BLEU-1 score of 0.364 and a METEOR score of 0.355. These metrics demonstrate the effectiveness of incorporating facial attributes into image captioning, providing more accurate and descriptive captions for portrait images.
Paper Structure (14 sections, 5 equations, 7 figures, 1 table)

This paper contains 14 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: A partial example of our FaceAttDB dataset faceattdb where we have highlighted the attributes in each caption.
  • Figure 2: A partial example of generated facial descriptions by our model FaceGemma which has covered the various range of facial attributes.
  • Figure 3: Methodology of FaceGemma. The Llama3 70B model generates detailed descriptions for 2000 faces in the FaceAttDB dataset, focusing on specific facial attributes. These descriptions form a new dataset, which, along with the corresponding images, is used to fine-tune the PaliGemma model. The fine-tuning process pairs each image with its corresponding description, enabling PaliGemma to learn to align its outputs with human-written descriptions. The final model, FaceGemma, is then evaluated for its effectiveness in accurately describing facial features in unseen images.
  • Figure 4: The inference process of the fine-tuned FaceGemma model. The testing image is converted into soft tokens using the SigLip visual feature extractor. Simultaneously, the prompt is tokenized into word tokens by Gemma. These image features and word tokens are concatenated and passed to the fine-tuned Gemma model to generate a descriptive response for the portrait image.
  • Figure 5: The figure presents a graph depicting the training loss over time for the PaliGemma model, which is used to generate descriptions of portrait images. The x-axis denotes the number of training steps, while the y-axis indicates the training loss. This loss is determined by comparing the model's generated descriptions with the actual ground truth descriptions. A decrease in training loss signifies improved model performance.
  • ...and 2 more figures