Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models
Yongsheng Yu, Jiebo Luo
TL;DR
The paper addresses demographic inference from images using large multimodal models (LMMs) and proposes an integrated benchmark across UTKFace, FairFace, and CACD. It introduces Chain-of-Thought augmented prompting to generate intermediate facial features and a name-based ethnicity cue, culminating in a refined demographic description that guides final predictions. Empirical results show that LMMs with CoT achieve strong zero-shot performance, reduced off-target predictions, and competitive accuracy relative to supervised baselines, with LLaVA achieving near-zero off-target rates. The work demonstrates the practical potential of interpretable, flexible LMMs for demographic inference in diverse, in-the-wild contexts, while highlighting remaining challenges in bias and misclassification under certain cues.
Abstract
Conventional demographic inference methods have predominantly operated under the supervision of accurately labeled data, yet struggle to adapt to shifting social landscapes and diverse cultural contexts, leading to narrow specialization and limited accuracy in applications. Recently, the emergence of large multimodal models (LMMs) has shown transformative potential across various research tasks, such as visual comprehension and description. In this study, we explore the application of LMMs to demographic inference and introduce a benchmark for both quantitative and qualitative evaluation. Our findings indicate that LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs, albeit with a propensity for off-target predictions. To enhance LMM performance and achieve comparability with supervised learning baselines, we propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
