Table of Contents
Fetching ...

Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models

Yongsheng Yu, Jiebo Luo

TL;DR

The paper addresses demographic inference from images using large multimodal models (LMMs) and proposes an integrated benchmark across UTKFace, FairFace, and CACD. It introduces Chain-of-Thought augmented prompting to generate intermediate facial features and a name-based ethnicity cue, culminating in a refined demographic description that guides final predictions. Empirical results show that LMMs with CoT achieve strong zero-shot performance, reduced off-target predictions, and competitive accuracy relative to supervised baselines, with LLaVA achieving near-zero off-target rates. The work demonstrates the practical potential of interpretable, flexible LMMs for demographic inference in diverse, in-the-wild contexts, while highlighting remaining challenges in bias and misclassification under certain cues.

Abstract

Conventional demographic inference methods have predominantly operated under the supervision of accurately labeled data, yet struggle to adapt to shifting social landscapes and diverse cultural contexts, leading to narrow specialization and limited accuracy in applications. Recently, the emergence of large multimodal models (LMMs) has shown transformative potential across various research tasks, such as visual comprehension and description. In this study, we explore the application of LMMs to demographic inference and introduce a benchmark for both quantitative and qualitative evaluation. Our findings indicate that LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs, albeit with a propensity for off-target predictions. To enhance LMM performance and achieve comparability with supervised learning baselines, we propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.

Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models

TL;DR

The paper addresses demographic inference from images using large multimodal models (LMMs) and proposes an integrated benchmark across UTKFace, FairFace, and CACD. It introduces Chain-of-Thought augmented prompting to generate intermediate facial features and a name-based ethnicity cue, culminating in a refined demographic description that guides final predictions. Empirical results show that LMMs with CoT achieve strong zero-shot performance, reduced off-target predictions, and competitive accuracy relative to supervised baselines, with LLaVA achieving near-zero off-target rates. The work demonstrates the practical potential of interpretable, flexible LMMs for demographic inference in diverse, in-the-wild contexts, while highlighting remaining challenges in bias and misclassification under certain cues.

Abstract

Conventional demographic inference methods have predominantly operated under the supervision of accurately labeled data, yet struggle to adapt to shifting social landscapes and diverse cultural contexts, leading to narrow specialization and limited accuracy in applications. Recently, the emergence of large multimodal models (LMMs) has shown transformative potential across various research tasks, such as visual comprehension and description. In this study, we explore the application of LMMs to demographic inference and introduce a benchmark for both quantitative and qualitative evaluation. Our findings indicate that LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs, albeit with a propensity for off-target predictions. To enhance LMM performance and achieve comparability with supervised learning baselines, we propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
Paper Structure (15 sections, 5 equations, 4 figures, 3 tables)

This paper contains 15 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Analysis of traditional Supervised Learning (SL) methods and naive LMMs in demographic inference task.
  • Figure 2: Conceptual workflow of our Chain-of-Thought prompting approach for demographic inference. The process begins with task prompts guiding LMM to articulate the facial features of the individual in the image, followed by name suggestions. Subsequently, the LMM employs these attributes as demographic descriptions to deduce age, race, and gender, and provides post-hoc explanations for its conclusions.
  • Figure 3: Full prompt example of the Chain-of-Thought augmented prompting for demographic inference.
  • Figure 4: Qualitative comparison of naive LMMs and COT-augmented LMMs. Red answers are incorrect, green ones are correct. 'Others' utkface in race categories includes those not White, Black, Asian, or Indian. Zoom in for a better view.