Text me the data: Generating Ground Pressure Sequence from Textual Descriptions for HAR
Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Lars Krupp, Vitor Fortes Rey, Paul Lukowicz
TL;DR
High-quality ground pressure data for HAR is costly to collect. The authors present Text-to-Pressure (T2P), a framework that discretizes pressure maps with a VQ-VAE and uses a text-conditioned autoregressive transformer to generate pressure sequences from textual activity descriptions, leveraging CLIP embeddings and a multi-stage data synthesis pipeline (GPT-4, T2M-GPT, SinMDM, joint2SMPL, PresSim) to produce 86,400 frames from 240 descriptions. Key contributions include a scalable text-to-pressure data generation method, a large synthetic dataset, and validation on real pressure data showing that combining synthetic and real data improves HAR macro F1 by about $5.9\%$; the approach achieves strong text-pressure consistency with $R^2=0.722$, $R^2_{bin}=0.892$, and $FID=1.83$. The work demonstrates the practicality of generating sensor data from textual descriptions and suggests applicability to other sensor modalities, enabling more robust HAR without exclusive reliance on costly ground-truth collection.
Abstract
In human activity recognition (HAR), the availability of substantial ground truth is necessary for training efficient models. However, acquiring ground pressure data through physical sensors itself can be cost-prohibitive, time-consuming. To address this critical need, we introduce Text-to-Pressure (T2P), a framework designed to generate extensive ground pressure sequences from textual descriptions of human activities using deep learning techniques. We show that the combination of vector quantization of sensor data along with simple text conditioned auto regressive strategy allows us to obtain high-quality generated pressure sequences from textual descriptions with the help of discrete latent correlation between text and pressure maps. We achieved comparable performance on the consistency between text and generated motion with an R squared value of 0.722, Masked R squared value of 0.892, and FID score of 1.83. Additionally, we trained a HAR model with the the synthesized data and evaluated it on pressure dynamics collected by a real pressure sensor which is on par with a model trained on only real data. Combining both real and synthesized training data increases the overall macro F1 score by 5.9 percent.
