Table of Contents
Fetching ...

BodyShapeGPT: SMPL Body Shape Manipulation with LLMs

Baldomero R. Árbol, Dan Casas

TL;DR

This work tackles converting natural language body descriptions into precise 3D human shapes by mapping text to SMPL-X shape parameters, enabling text-driven avatar generation. It trains a fine-tuned LLM (Llama-3 8B) with LoRA and 4-bit quantization on a dataset of $18{,}000$ training samples and $2{,}000$ evaluation cases, using a composite loss $\mathcal{L} = \mathcal{L}_\text{LLM} + \mathcal{L}_\text{shape} + \mathcal{L}_\text{measurements}$ to predict the 10D shape vector $\boldsymbol{\beta} \in \mathbb{R}^{10}$. Quantitative results show higher accuracy across body measurements and BMI distributions compared to baselines, while qualitative prompts yield robust and diverse avatars. This approach enables fast, text-driven avatar creation for storytelling and virtual environments, expanding human-machine interaction by enabling shape control directly from natural language.

Abstract

Generative AI models provide a wide range of tools capable of performing complex tasks in a fraction of the time it would take a human. Among these, Large Language Models (LLMs) stand out for their ability to generate diverse texts, from literary narratives to specialized responses in different fields of knowledge. This paper explores the use of fine-tuned LLMs to identify physical descriptions of people, and subsequently create accurate representations of avatars using the SMPL-X model by inferring shape parameters. We demonstrate that LLMs can be trained to understand and manipulate the shape space of SMPL, allowing the control of 3D human shapes through natural language. This approach promises to improve human-machine interaction and opens new avenues for customization and simulation in virtual environments.

BodyShapeGPT: SMPL Body Shape Manipulation with LLMs

TL;DR

This work tackles converting natural language body descriptions into precise 3D human shapes by mapping text to SMPL-X shape parameters, enabling text-driven avatar generation. It trains a fine-tuned LLM (Llama-3 8B) with LoRA and 4-bit quantization on a dataset of training samples and evaluation cases, using a composite loss to predict the 10D shape vector . Quantitative results show higher accuracy across body measurements and BMI distributions compared to baselines, while qualitative prompts yield robust and diverse avatars. This approach enables fast, text-driven avatar creation for storytelling and virtual environments, expanding human-machine interaction by enabling shape control directly from natural language.

Abstract

Generative AI models provide a wide range of tools capable of performing complex tasks in a fraction of the time it would take a human. Among these, Large Language Models (LLMs) stand out for their ability to generate diverse texts, from literary narratives to specialized responses in different fields of knowledge. This paper explores the use of fine-tuned LLMs to identify physical descriptions of people, and subsequently create accurate representations of avatars using the SMPL-X model by inferring shape parameters. We demonstrate that LLMs can be trained to understand and manipulate the shape space of SMPL, allowing the control of 3D human shapes through natural language. This approach promises to improve human-machine interaction and opens new avenues for customization and simulation in virtual environments.
Paper Structure (12 sections, 1 equation, 4 figures, 1 table)

This paper contains 12 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Example of dataset entries that relate SMPL-X shape parameters to their corresponding verbal descriptions
  • Figure 2: Cross Entropy Loss of the validation set for each training iteration.
  • Figure 3: Accuracy of baseline model (a) and our model (b), evaluated using the Body Mass Index (BMI) computed from the regressed avatars. The ground truth range for each category is shown as horizontal solid boxes. Using our model, most of test samples (i.e., color dots) fall into the correct range.
  • Figure 4: Generated avatar with different prompts