Table of Contents
Fetching ...

Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-temporal Masked Transformers

Vandad Davoodnia, Ali Etemad

TL;DR

This paper tackles 3D human pose estimation from pressure-based tactile data, addressing privacy and robustness under occlusions where vision fails. It introduces a spatio-temporal transformer architecture, extending ViTPose with 2+1D convolutions and an encoder-decoder framework, and uses a masked auto-encoder pre-training to learn robust representations from ambiguous pressure maps. The method leverages temporal crops and self-supervised pre-training to improve pose estimation, achieving state-of-the-art results on Intelligent Carpet and SLP datasets. The findings demonstrate the value of temporal context and SSL for tactile sensing, with practical implications for privacy-preserving pose estimation in real-world settings.

Abstract

Despite the impressive performance of vision-based pose estimators, they generally fail to perform well under adverse vision conditions and often don't satisfy the privacy demands of customers. As a result, researchers have begun to study tactile sensing systems as an alternative. However, these systems suffer from noisy and ambiguous recordings. To tackle this problem, we propose a novel solution for pose estimation from ambiguous pressure data. Our method comprises a spatio-temporal vision transformer with an encoder-decoder architecture. Detailed experiments on two popular public datasets reveal that our model outperforms existing solutions in the area. Moreover, we observe that increasing the number of temporal crops in the early stages of the network positively impacts the performance while pre-training the network in a self-supervised setting using a masked auto-encoder approach also further improves the results.

Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-temporal Masked Transformers

TL;DR

This paper tackles 3D human pose estimation from pressure-based tactile data, addressing privacy and robustness under occlusions where vision fails. It introduces a spatio-temporal transformer architecture, extending ViTPose with 2+1D convolutions and an encoder-decoder framework, and uses a masked auto-encoder pre-training to learn robust representations from ambiguous pressure maps. The method leverages temporal crops and self-supervised pre-training to improve pose estimation, achieving state-of-the-art results on Intelligent Carpet and SLP datasets. The findings demonstrate the value of temporal context and SSL for tactile sensing, with practical implications for privacy-preserving pose estimation in real-world settings.

Abstract

Despite the impressive performance of vision-based pose estimators, they generally fail to perform well under adverse vision conditions and often don't satisfy the privacy demands of customers. As a result, researchers have begun to study tactile sensing systems as an alternative. However, these systems suffer from noisy and ambiguous recordings. To tackle this problem, we propose a novel solution for pose estimation from ambiguous pressure data. Our method comprises a spatio-temporal vision transformer with an encoder-decoder architecture. Detailed experiments on two popular public datasets reveal that our model outperforms existing solutions in the area. Moreover, we observe that increasing the number of temporal crops in the early stages of the network positively impacts the performance while pre-training the network in a self-supervised setting using a masked auto-encoder approach also further improves the results.
Paper Structure (9 sections, 1 equation, 2 figures, 4 tables)

This paper contains 9 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: An overview of our model is illustrated. (a) shows the masked auto-encoder, and (b) shows our pose estimation network.
  • Figure 2: A comparison of our result and 3D-CNN luo2021intelligent