Table of Contents
Fetching ...

ChatPose: Chatting about 3D Human Pose

Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black

TL;DR

ChatPose introduces a multimodal LLM that outputs SMPL pose parameters $\theta$ and decodes them into a 3D mesh $M(\theta,\beta)$ with $\beta=0$, by embedding poses as a <POSE> token. It unifies traditional pose estimation and generation and enables SPG and RPE tasks that leverage world knowledge, demonstrated through new benchmarks and competitive results against multimodal baselines. Trained with frozen vision encoders and a SMPL projection layer via LoRA on text-to-pose, image-to-pose, and multi-modal instruction data, ChatPose shows zero-shot pose reasoning in dialog and robustness to occlusions, highlighting the potential for interactive 3D pose analysis. The work introduces SPG and RPE benchmarks and suggests future extensions to video input, pose editing, and broader pose reasoning, signaling a shift toward reasoning-driven 3D pose understanding with LLMs.

Abstract

We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.

ChatPose: Chatting about 3D Human Pose

TL;DR

ChatPose introduces a multimodal LLM that outputs SMPL pose parameters and decodes them into a 3D mesh with , by embedding poses as a <POSE> token. It unifies traditional pose estimation and generation and enables SPG and RPE tasks that leverage world knowledge, demonstrated through new benchmarks and competitive results against multimodal baselines. Trained with frozen vision encoders and a SMPL projection layer via LoRA on text-to-pose, image-to-pose, and multi-modal instruction data, ChatPose shows zero-shot pose reasoning in dialog and robustness to occlusions, highlighting the potential for interactive 3D pose analysis. The work introduces SPG and RPE benchmarks and suggests future extensions to video input, pose editing, and broader pose reasoning, signaling a shift toward reasoning-driven 3D pose understanding with LLMs.

Abstract

We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.
Paper Structure (25 sections, 1 equation, 13 figures, 14 tables)

This paper contains 25 sections, 1 equation, 13 figures, 14 tables.

Figures (13)

  • Figure 1: We introduce ChatPose, a multimodel LLM designed for chatting about human pose that produces 3D human poses (SMPL pose parameters) upon user request. ChatPose features a specialized SMPL projection layer trained to convert language embeddings into 3D human pose parameters. Our demonstration includes conversations both without (left) and with (right) an image input. Upon detection of a pose token, the token is used to estimate the SMPL pose parameters and subsequently generate the corresponding 3D body mesh.
  • Figure 2: Method and Training Overview. Our model is composed of a multi-modal LLM (with vision encoder, vision projection layer and LLM), a SMPL projection layer, and the parametric human body model, i.e. SMPL smpl. The multi-modal LLM processes text and image inputs (if provided) to generate textual responses. In the training phase, we focus on training the SMPL projection layer and fine-tuning the LLM, while keeping the other components frozen. The three data types used for the end-to-end training are: text-to-3D pose generation, image-to-pose estimation, and multi-modal instruction-following data. When an image is available, its information is used by the LLM to deduce an answer. If the user inquires about a SMPL pose, the LLM responds with a <pose> token. The embedding related to this token is then used to predict the SMPL pose parameters, leading to the generation of a body mesh, as visualized.
  • Figure 3: Pose Generation. GPT-4 (DALL·E) gpt4 generates images that depict the correct pose but does not explictly generate 3D poses. In contrast, PoseScript posescript is a task-specific method for 3D pose from language but it is not able to relate high-level concepts like "searching under furniture" with 3D pose. In contrast, ChatPose, understands high-level concepts and how to relate them to 3D pose. The methods in orange address SPG, while the green region indicates the "classical" approach. The first two query examples are sourced from our SPG benchmark, which offers implicit text queries regarding human poses. The third example is derived from the PoseScript test set, which has detailed descriptions of human poses.
  • Figure 4: We compare multi-modal LLMs (LLaVA llava, GPT-4 gpt4) and traditional HMR-style methods (HMR2.0 hmr2, SPIN spin) for classical human pose estimation. LLaVA* is LLaVA fine-tuned with keypoint data.
  • Figure 5: Comparison with LLaVA llava and classical HMR-style methods (HMR2.0 hmr2 and SPIN spin) on reasoning-based human pose estimation. For each method, we utilize the entire image provided by the user as input, without applying cropping. Methods involving LLMs are highlighted in orange, while those that are purely task-specific methods, are marked in green.
  • ...and 8 more figures