PoseFix: Correcting 3D Human Poses with Natural Language

Ginger Delmas; Philippe Weinzaepfel; Francesc Moreno-Noguer; Grégory Rogez

PoseFix: Correcting 3D Human Poses with Natural Language

Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez

TL;DR

PoseFix tackles correcting 3D human poses using natural language feedback by introducing a dataset of over 135k pose pairs $(A,B)$ with textual modifiers. It develops two baselines: a text-based pose editing model with a conditional variational autoencoder and a correctional text generation model based on a transformer, each leveraging pose information and language cues. The study demonstrates that pretraining on automatically generated modifiers and targeted data augmentations substantially improves both pose-editing quality and the coherence of generated feedback, enabling practical use in animation, coaching, and robot teaching. Overall, PoseFix provides a scalable data collection pipeline and effective baselines, establishing a foundation for language-guided, fine-grained 3D pose modification.

Abstract

Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections between natural language and 3D human pose, none focus on describing 3D body pose differences. In this paper, we tackle the problem of correcting 3D human poses with natural language. To this end, we introduce the PoseFix dataset, which consists of several thousand paired 3D poses and their corresponding text feedback, that describe how the source pose needs to be modified to obtain the target pose. We demonstrate the potential of this dataset on two tasks: (1) text-based pose editing, that aims at generating corrected 3D body poses given a query pose and a text modifier; and (2) correctional text generation, where instructions are generated based on the differences between two body poses.

PoseFix: Correcting 3D Human Poses with Natural Language

TL;DR

PoseFix tackles correcting 3D human poses using natural language feedback by introducing a dataset of over 135k pose pairs

with textual modifiers. It develops two baselines: a text-based pose editing model with a conditional variational autoencoder and a correctional text generation model based on a transformer, each leveraging pose information and language cues. The study demonstrates that pretraining on automatically generated modifiers and targeted data augmentations substantially improves both pose-editing quality and the coherence of generated feedback, enabling practical use in animation, coaching, and robot teaching. Overall, PoseFix provides a scalable data collection pipeline and effective baselines, establishing a foundation for language-guided, fine-grained 3D pose modification.

Abstract

Paper Structure (16 sections, 2 equations, 14 figures, 6 tables)

This paper contains 16 sections, 2 equations, 14 figures, 6 tables.

Introduction
Related Work
The PoseFix dataset
Pair selection process
Collection of human annotations
Generating annotations automatically
Statistics and semantic analysis
Application to Text-based Pose Editing
Application to correctional text generation
Conclusion
PoseFix complementary information
Human annotations
Automatic annotations
Original triplets of the generation examples
Miscellaneous visualizations
...and 1 more sections

Figures (14)

Figure 1: Illustration of the tasks addressed with the new PoseFix dataset, which consists of textual descriptions of the difference between two 3D body poses.
Figure 2: Examples of pose pairs and their annotated modifier in PoseFix. The source pose is shown in gray and the target pose in purple. Poses from in-sequence (IS) pairs are from the same motion clip; unlike out-of-sequence (OOS) pairs.
Figure 3: Left: Data presented to the annotators. The slider makes it possible to look at the poses under different viewpoints. Right: word cloud of the PoseFix annotations.
Figure 4: Overview of our text-based pose editing baseline. The top part represents a standard VAE, where poses are encoded into a Gaussian distribution. At training time, a latent variable is sampled and decoded into a pose to learn pose reconstruction. The bottom left part represents the conditioning: the text is encoded using a frozen DistilBERT with a small transformer on top. It is combined with source pose features in the fusion module, from which we predict a Gaussian distribution. A KL loss ensures the alignment of the distributions from the standard VAE and the conditioning. At test time, we sample from the latter to predict the target pose.
Figure 5: Generated poses for the text-based pose editing task on PoseFix queries from the left blocks. Two views of each pose are shown on the same ground plane for better visualization of the 3D. Generated poses are shown in blue. Original poses B from the PoseFix dataset are in the supplementary material.
...and 9 more figures

PoseFix: Correcting 3D Human Poses with Natural Language

TL;DR

Abstract

PoseFix: Correcting 3D Human Poses with Natural Language

Authors

TL;DR

Abstract

Table of Contents

Figures (14)