UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Yiheng Li; Ruibing Hou; Hong Chang; Shiguang Shan; Xilin Chen

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

TL;DR

UniPose proposes a unified multimodal framework for human pose comprehension, generation, and editing by introducing a pose tokenizer that maps 3D SMPL poses to discrete tokens, a pose-aware visual processor, and a pose-aware Large Language Model with mixed-attention. This shared pose-text space enables end-to-end reasoning across images, text, and 3D poses through a four-stage training pipeline and an instruction-tuning regime, achieving competitive results on pose-to-text, pose-diff, image-to-text, image-diff, text-to-pose, pose generation, and pose editing tasks. The approach demonstrates strong cross-task transfer, zero-shot capabilities, and significant improvements when using a dual-encoder visual backbone and bidirectional pose token attention, while revealing the remaining gap to fully specialized pose estimators in certain scenarios. Overall, UniPose offers a first-of-its-kind, general-purpose framework for integrated pose understanding, generation, and editing with practical implications for multimodal human-pose applications.

Abstract

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

TL;DR

Abstract

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)