Table of Contents
Fetching ...

Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

Muhammad Saif Ullah Khan, Muhammad Ferjad Naeem, Federico Tombari, Luc Van Gool, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

The effectiveness of the LLM-based pipeline for creating contextual descriptions of human body poses in images using only auxiliary attributes in enabling zero-shot human-centric classification using CLIP is demonstrated and the FocusCLIP framework is introduced, which incorporates Subject-Focused Attention in CLIP for improved text-to-image alignment.

Abstract

We present a novel LLM-based pipeline for creating contextual descriptions of human body poses in images using only auxiliary attributes. This approach facilitates the creation of the MPII Pose Descriptions dataset, which includes natural language annotations for 17,367 images containing people engaged in 410 distinct activities. We demonstrate the effectiveness of our pose descriptions in enabling zero-shot human-centric classification using CLIP. Moreover, we introduce the FocusCLIP framework, which incorporates Subject-Focused Attention (SFA) in CLIP for improved text-to-image alignment. Our models were pretrained on the MPII Pose Descriptions dataset and their zero-shot performance was evaluated on five unseen datasets covering three tasks. FocusCLIP outperformed the baseline CLIP model, achieving an average accuracy increase of 8.61\% (33.65\% compared to CLIP's 25.04\%). Notably, our approach yielded improvements of 3.98\% in activity recognition, 14.78\% in age classification, and 7.06\% in emotion recognition. These results highlight the potential of integrating detailed pose descriptions and subject-level guidance into general pretraining frameworks for enhanced performance in downstream tasks.

Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

TL;DR

The effectiveness of the LLM-based pipeline for creating contextual descriptions of human body poses in images using only auxiliary attributes in enabling zero-shot human-centric classification using CLIP is demonstrated and the FocusCLIP framework is introduced, which incorporates Subject-Focused Attention in CLIP for improved text-to-image alignment.

Abstract

We present a novel LLM-based pipeline for creating contextual descriptions of human body poses in images using only auxiliary attributes. This approach facilitates the creation of the MPII Pose Descriptions dataset, which includes natural language annotations for 17,367 images containing people engaged in 410 distinct activities. We demonstrate the effectiveness of our pose descriptions in enabling zero-shot human-centric classification using CLIP. Moreover, we introduce the FocusCLIP framework, which incorporates Subject-Focused Attention (SFA) in CLIP for improved text-to-image alignment. Our models were pretrained on the MPII Pose Descriptions dataset and their zero-shot performance was evaluated on five unseen datasets covering three tasks. FocusCLIP outperformed the baseline CLIP model, achieving an average accuracy increase of 8.61\% (33.65\% compared to CLIP's 25.04\%). Notably, our approach yielded improvements of 3.98\% in activity recognition, 14.78\% in age classification, and 7.06\% in emotion recognition. These results highlight the potential of integrating detailed pose descriptions and subject-level guidance into general pretraining frameworks for enhanced performance in downstream tasks.
Paper Structure (19 sections, 4 equations, 12 figures, 8 tables)

This paper contains 19 sections, 4 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Our LLM pipeline creates grounded pose descriptions for images of people using only auxiliary attributes (activity labels and 2D keypoint coordinates) obtained from dataset annotations or extracted from the images using pretrained models.
  • Figure 1: Impact of activity labels. We manually labeled the keypoints on the statue and defined an activity name (a). Using the GPT-4 model, we compare the pose descriptions generated from activity labels and keypoint data (b) with those generated only from keypoint data (c). The additional contextual information included in the LLM response when using the activity label is bolded. These details are absent when the activity label is omitted.
  • Figure 2: FocusCLIP outperforms the baseline CLIP model on three zero-shot classification tasks (activity, age, emotion). Both models are pretrained on our MPII Pose Descriptions dataset.
  • Figure 2: Impact of personas. We manually defined an activity name and labeled keypoints for two people in a sample image (a) and compared GPT-4 output using our prompt, which specifies a persona in the first sentence (b) with a modified prompt omitting the LLM role definition (c). For easier comparison, we segment the LLM output into three parts: the first talking about the overall image, the second talking about the person on the left, and the third talking about the person on the right. The text segments describing body pose are bold, whereas the text segments drawing insights from the pose are italic. The incorrect or superfluous pose descriptions are bold-italic. When we ask the LLM to act as an expert pose analyzer (b), it makes fewer mistakes, uses more engaging language, and provides higher-quality insights about the pose. Compared to this, when directly asked to describe pose without specifying a role (c), the LLM focuses on insignificant details (i.e., legs, which are not important to the activity), writes monotonic sentences, makes more mistakes, and does not provide useful insights about the interaction.
  • Figure 3: Sample Pose Description. We use our pipeline to generate pose descriptions for two famous artworks, the Statue of David and the Mona Lisa. The LLM was provided body keypoints obtained using an off-the-shelf pose estimation network and manual activity labels.
  • ...and 7 more figures