Table of Contents
Fetching ...

Generalized People Diversity: Learning a Human Perception-Aligned Diversity Representation for People Images

Hansa Srinivasan, Candice Schumann, Aradhana Sinha, David Madras, Gbolahan Oluwafemi Olanubi, Alex Beutel, Susanna Ricco, Jilin Chen

TL;DR

This work tackles the challenge of diversifying images of people in retrieval tasks beyond a fixed set of attributes. It introduces PATHS, a two-stage representation learning approach that first uses text-guided subspace extraction to focus on person-related diversity and then aligns this space with human perceptual judgments via triplet-based metric learning. PATHS is demonstrated to improve ranking diversity on both a narrow Occupations dataset and a broad Diverse People Dataset, outperforming attribute-based baselines while avoiding costly labels. The approach offers a scalable, human-aligned framework for diverse image retrieval with potential impact on fairness and representation in large-scale visual systems.

Abstract

Capturing the diversity of people in images is challenging: recent literature tends to focus on diversifying one or two attributes, requiring expensive attribute labels or building classifiers. We introduce a diverse people image ranking method which more flexibly aligns with human notions of people diversity in a less prescriptive, label-free manner. The Perception-Aligned Text-derived Human representation Space (PATHS) aims to capture all or many relevant features of people-related diversity, and, when used as the representation space in the standard Maximal Marginal Relevance (MMR) ranking algorithm, is better able to surface a range of types of people-related diversity (e.g. disability, cultural attire). PATHS is created in two stages. First, a text-guided approach is used to extract a person-diversity representation from a pre-trained image-text model. Then this representation is fine-tuned on perception judgments from human annotators so that it captures the aspects of people-related similarity that humans find most salient. Empirical results show that the PATHS method achieves diversity better than baseline methods, according to side-by-side ratings from human annotators.

Generalized People Diversity: Learning a Human Perception-Aligned Diversity Representation for People Images

TL;DR

This work tackles the challenge of diversifying images of people in retrieval tasks beyond a fixed set of attributes. It introduces PATHS, a two-stage representation learning approach that first uses text-guided subspace extraction to focus on person-related diversity and then aligns this space with human perceptual judgments via triplet-based metric learning. PATHS is demonstrated to improve ranking diversity on both a narrow Occupations dataset and a broad Diverse People Dataset, outperforming attribute-based baselines while avoiding costly labels. The approach offers a scalable, human-aligned framework for diverse image retrieval with potential impact on fairness and representation in large-scale visual systems.

Abstract

Capturing the diversity of people in images is challenging: recent literature tends to focus on diversifying one or two attributes, requiring expensive attribute labels or building classifiers. We introduce a diverse people image ranking method which more flexibly aligns with human notions of people diversity in a less prescriptive, label-free manner. The Perception-Aligned Text-derived Human representation Space (PATHS) aims to capture all or many relevant features of people-related diversity, and, when used as the representation space in the standard Maximal Marginal Relevance (MMR) ranking algorithm, is better able to surface a range of types of people-related diversity (e.g. disability, cultural attire). PATHS is created in two stages. First, a text-guided approach is used to extract a person-diversity representation from a pre-trained image-text model. Then this representation is fine-tuned on perception judgments from human annotators so that it captures the aspects of people-related similarity that humans find most salient. Empirical results show that the PATHS method achieves diversity better than baseline methods, according to side-by-side ratings from human annotators.
Paper Structure (38 sections, 2 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 2 equations, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: An example of image ranking results for the query of "Bride" with our proposed PATHS method. This method promotes the images outlined in red: a bride in culturally Chinese attire, and a bride in a wheelchair.
  • Figure 2: A spectrum of people image diversification methods: from narrow (considers too few attributes of people diversity) to broad (considers too many attributes of visual diversity). At the top of this figure, we list where various attributes tend to fall on this spectrum, with attributes such as gender presentation being most prevalent in narrow settings due to their more frequent availability.
  • Figure 3: Across both datasets, PATHS achieves the best added diversity ($\uparrow=$better). On the Occupations dataset, which penalizes overly-broad diversification, it outperforms all other methods. On the Diverse People dataset, which penalizes overly-narrow diversification, PATHS , the text-derived space, and the base CoCa Embedding (the broadest diversification method) both outperform all other baselines. As a non-person specific embedding that encompasses all types of visual diversity, CoCa adds the most general visual diversity (hence high performance on Diverse People Dataset where all images are of people), but does not add people-specific diversity (hence poor performance on the Occupations dataset which contains many non-person images).
  • Figure 4: An example of image retrieval results for the query of "bride" over three different methods, No diversification baseline, SkinTone + Gender Expression Baseline, and PATHS. PATHS promotes two images outlined in red: a photo of a bride in traditionally Chinese cultural attire, and a photo of a disabled bride. SkinTone + Gender Expression Baseline promotes an image of a black bride surrounded by a wedding party. Here, we see that the gender component of this baseline creates an odd artifact: surface images that also have more men for the query "bride."
  • Figure 5: Full SxS results for all methods against the undiversified set, on both datasets for $\alpha=[0.3, 0.5, 0.7]$. "Wins" are where the diversification method's side was rated as more diverse, "neutral" was both sides were rated equally diverse, and "loss" is where the diversification method's side was rated as less diverse.
  • ...and 7 more figures