Table of Contents
Fetching ...

FaceInsight: A Multimodal Large Language Model for Face Perception

Jingzhi Li, Changjiang Luo, Ruoyu Chen, Hua Zhang, Wenqi Ren, Jianhou Gan, Xiaochun Cao

TL;DR

FaceInsight addresses the deficiency of general multimodal LLMs in face perception by integrating a face segmentation modality and two knowledge-guided constraints. The approach uses a ViT-based image encoder, a segmentation encoder, and a text encoder to feed a multimodal LLM, with a Correlation Constraint Module that builds a sparse adjacency $\hat{C}$ from co-occurrences via $p_{ij}=\frac{m_{ij}}{n_i}$ and refines prompts through $T_p^l=\rho(\hat{C}T_p^{l-1}W^{l-1})$, plus a Logical Constraint Module enforcing deterministic relations through $\mathcal{L}_{c}$. The model is trained with $\mathcal{L}=\mathcal{L}_{\text{bce}}+\mathcal{L}_{c}$, while most parameters are frozen, and segmentation maps are integrated as $I_s$ to provide localized structure. Across three face-perception tasks (attributes, age/gender/race, expressions) and six datasets, FaceInsight substantially outperforms nine competing MLLMs in both training-free and fine-tuned settings, with qualitative visualizations showing fewer hallucinations and more coherent facial descriptions. The work demonstrates that combining structured facial knowledge and region-aware visuals with LLM reasoning yields reliable, fine-grained facial analysis suitable for real-world applications.

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.

FaceInsight: A Multimodal Large Language Model for Face Perception

TL;DR

FaceInsight addresses the deficiency of general multimodal LLMs in face perception by integrating a face segmentation modality and two knowledge-guided constraints. The approach uses a ViT-based image encoder, a segmentation encoder, and a text encoder to feed a multimodal LLM, with a Correlation Constraint Module that builds a sparse adjacency from co-occurrences via and refines prompts through , plus a Logical Constraint Module enforcing deterministic relations through . The model is trained with , while most parameters are frozen, and segmentation maps are integrated as to provide localized structure. Across three face-perception tasks (attributes, age/gender/race, expressions) and six datasets, FaceInsight substantially outperforms nine competing MLLMs in both training-free and fine-tuned settings, with qualitative visualizations showing fewer hallucinations and more coherent facial descriptions. The work demonstrates that combining structured facial knowledge and region-aware visuals with LLM reasoning yields reliable, fine-grained facial analysis suitable for real-world applications.

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.

Paper Structure

This paper contains 17 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An overview of FaceInsight. To fully integrate fine-grained knowledge of facial information, we design a correlation constraint module and a logical constraint module to model uncertain dependencies and certain logical relationships, respectively. Additionally, we incorporate the face segmentation modality to provide region-level visual information, embedding spatial-aware facial visual knowledge into the framework.
  • Figure 2: Performance comparison of FaceInsight and nine MLLMs on the MAAD and CelebA datasets.
  • Figure 3: Performance comparison of FaceInsight and nine MLLMs on the FairFace dataset.
  • Figure 4: Performance comparison of FaceInsight and nine MLLMs on the UTKFace dataset.
  • Figure 5: Performance comparison of FaceInsight and nine MLLMs on the ExpW and RAF-DB datasets.
  • ...and 2 more figures