Table of Contents
Fetching ...

Text me the data: Generating Ground Pressure Sequence from Textual Descriptions for HAR

Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Lars Krupp, Vitor Fortes Rey, Paul Lukowicz

TL;DR

High-quality ground pressure data for HAR is costly to collect. The authors present Text-to-Pressure (T2P), a framework that discretizes pressure maps with a VQ-VAE and uses a text-conditioned autoregressive transformer to generate pressure sequences from textual activity descriptions, leveraging CLIP embeddings and a multi-stage data synthesis pipeline (GPT-4, T2M-GPT, SinMDM, joint2SMPL, PresSim) to produce 86,400 frames from 240 descriptions. Key contributions include a scalable text-to-pressure data generation method, a large synthetic dataset, and validation on real pressure data showing that combining synthetic and real data improves HAR macro F1 by about $5.9\%$; the approach achieves strong text-pressure consistency with $R^2=0.722$, $R^2_{bin}=0.892$, and $FID=1.83$. The work demonstrates the practicality of generating sensor data from textual descriptions and suggests applicability to other sensor modalities, enabling more robust HAR without exclusive reliance on costly ground-truth collection.

Abstract

In human activity recognition (HAR), the availability of substantial ground truth is necessary for training efficient models. However, acquiring ground pressure data through physical sensors itself can be cost-prohibitive, time-consuming. To address this critical need, we introduce Text-to-Pressure (T2P), a framework designed to generate extensive ground pressure sequences from textual descriptions of human activities using deep learning techniques. We show that the combination of vector quantization of sensor data along with simple text conditioned auto regressive strategy allows us to obtain high-quality generated pressure sequences from textual descriptions with the help of discrete latent correlation between text and pressure maps. We achieved comparable performance on the consistency between text and generated motion with an R squared value of 0.722, Masked R squared value of 0.892, and FID score of 1.83. Additionally, we trained a HAR model with the the synthesized data and evaluated it on pressure dynamics collected by a real pressure sensor which is on par with a model trained on only real data. Combining both real and synthesized training data increases the overall macro F1 score by 5.9 percent.

Text me the data: Generating Ground Pressure Sequence from Textual Descriptions for HAR

TL;DR

High-quality ground pressure data for HAR is costly to collect. The authors present Text-to-Pressure (T2P), a framework that discretizes pressure maps with a VQ-VAE and uses a text-conditioned autoregressive transformer to generate pressure sequences from textual activity descriptions, leveraging CLIP embeddings and a multi-stage data synthesis pipeline (GPT-4, T2M-GPT, SinMDM, joint2SMPL, PresSim) to produce 86,400 frames from 240 descriptions. Key contributions include a scalable text-to-pressure data generation method, a large synthetic dataset, and validation on real pressure data showing that combining synthetic and real data improves HAR macro F1 by about ; the approach achieves strong text-pressure consistency with , , and . The work demonstrates the practicality of generating sensor data from textual descriptions and suggests applicability to other sensor modalities, enabling more robust HAR without exclusive reliance on costly ground-truth collection.

Abstract

In human activity recognition (HAR), the availability of substantial ground truth is necessary for training efficient models. However, acquiring ground pressure data through physical sensors itself can be cost-prohibitive, time-consuming. To address this critical need, we introduce Text-to-Pressure (T2P), a framework designed to generate extensive ground pressure sequences from textual descriptions of human activities using deep learning techniques. We show that the combination of vector quantization of sensor data along with simple text conditioned auto regressive strategy allows us to obtain high-quality generated pressure sequences from textual descriptions with the help of discrete latent correlation between text and pressure maps. We achieved comparable performance on the consistency between text and generated motion with an R squared value of 0.722, Masked R squared value of 0.892, and FID score of 1.83. Additionally, we trained a HAR model with the the synthesized data and evaluated it on pressure dynamics collected by a real pressure sensor which is on par with a model trained on only real data. Combining both real and synthesized training data increases the overall macro F1 score by 5.9 percent.
Paper Structure (12 sections, 4 equations, 3 figures, 4 tables)

This paper contains 12 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Data generation pipeline that combines GPT-4, T2M-GPT, SinMDM, joint2SMPL, and PresSim to generate pressure map sequences from activity description for training T2P.
  • Figure 2: Illustration of the T2P architecture that utilizes Variational auto-encoder (VQ-VAE) and auto-regressive transformer (T2P) module to train the model.
  • Figure 3: visualization of T2P results for two activities walk back and sit.