Table of Contents
Fetching ...

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

TL;DR

UniPose proposes a unified multimodal framework for human pose comprehension, generation, and editing by introducing a pose tokenizer that maps 3D SMPL poses to discrete tokens, a pose-aware visual processor, and a pose-aware Large Language Model with mixed-attention. This shared pose-text space enables end-to-end reasoning across images, text, and 3D poses through a four-stage training pipeline and an instruction-tuning regime, achieving competitive results on pose-to-text, pose-diff, image-to-text, image-diff, text-to-pose, pose generation, and pose editing tasks. The approach demonstrates strong cross-task transfer, zero-shot capabilities, and significant improvements when using a dual-encoder visual backbone and bidirectional pose token attention, while revealing the remaining gap to fully specialized pose estimators in certain scenarios. Overall, UniPose offers a first-of-its-kind, general-purpose framework for integrated pose understanding, generation, and editing with practical implications for multimodal human-pose applications.

Abstract

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

TL;DR

UniPose proposes a unified multimodal framework for human pose comprehension, generation, and editing by introducing a pose tokenizer that maps 3D SMPL poses to discrete tokens, a pose-aware visual processor, and a pose-aware Large Language Model with mixed-attention. This shared pose-text space enables end-to-end reasoning across images, text, and 3D poses through a four-stage training pipeline and an instruction-tuning regime, achieving competitive results on pose-to-text, pose-diff, image-to-text, image-diff, text-to-pose, pose generation, and pose editing tasks. The approach demonstrates strong cross-task transfer, zero-shot capabilities, and significant improvements when using a dual-encoder visual backbone and bidirectional pose token attention, while revealing the remaining gap to fully specialized pose estimators in certain scenarios. Overall, UniPose offers a first-of-its-kind, general-purpose framework for integrated pose understanding, generation, and editing with practical implications for multimodal human-pose applications.

Abstract

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

Paper Structure

This paper contains 24 sections, 4 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: UniPose can handle pose comprehension, generation and editing tasks under different instructions within a unified framework.
  • Figure 2: Method overview: UniPose comprises a Pose Tokenizer, Visual Processor and a pose-aware language LLM. Combining Pose Tokens learned by pose tokenizer, Visual Embeddings from visual processor and Text Tokens from text tokenizer, UniPose enables joint modeling of pose comprehension, generation and editing within a unified visual-language backbone.
  • Figure 2: Prompt to query GPT-4 for refining text in the ImageScript dataset.
  • Figure 3: The training paradigm of UniPose.
  • Figure 3: Prompt to query GPT-4 for refining text in the ImageDiff dataset.
  • ...and 4 more figures