Table of Contents
Fetching ...

Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation

Daniel Sungho Jung, Kyoung Mu Lee

TL;DR

Dense foot contact estimation from a single image is hindered by shoe appearance variability and ambiguous ground cues. FECO addresses this by combining shoe style-invariant learning (via shoe style-content randomization and external shoe data) with ground-aware representations (pixel height maps and ground normals) and a Transformer-based decoder for dense, pixel-level contacts. Key contributions include the FECO framework, the dual randomization strategy, explicit ground-geometry supervision, and the COFE dataset, with state-of-the-art performance on MMVP and strong cross-dataset generalization. This work enables more robust interpretation of foot-ground interactions in monocular imagery, with potential benefits for sports analytics, rehabilitation, and AR/VR applications.

Abstract

Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.

Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation

TL;DR

Dense foot contact estimation from a single image is hindered by shoe appearance variability and ambiguous ground cues. FECO addresses this by combining shoe style-invariant learning (via shoe style-content randomization and external shoe data) with ground-aware representations (pixel height maps and ground normals) and a Transformer-based decoder for dense, pixel-level contacts. Key contributions include the FECO framework, the dual randomization strategy, explicit ground-geometry supervision, and the COFE dataset, with state-of-the-art performance on MMVP and strong cross-dataset generalization. This work enables more robust interpretation of foot-ground interactions in monocular imagery, with potential benefits for sports analytics, rehabilitation, and AR/VR applications.

Abstract

Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.

Paper Structure

This paper contains 27 sections, 17 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Overall pipeline of FECO. Our method first applies low-level style randomization on input image and encodes it into image feature using a ViT backbone. From image feature, shoe style and shoe content randomization are performed with random shoe images from the UT Zappos50K yu2014fine dataset to produce a shoe style-invariant feature. This feature is then processed by a ground feature encoder to extract ground feature, which is used to predict pixel height map and ground normal. Finally, the ground feature and shoe style-invariant feature are fused to form a contact feature, which is decoded to produce the final foot contact prediction.
  • Figure 2: COFE Dataset. We manually annotate joint-level foot contact for samples in OpenPose cao2019openpose, InstaVariety kanazawa2019learning, PennAction zhang2013actemes, and MPII andriluka20142d datasets. In the visualization, black indicates contacting joints and white represents non-contacting joints.
  • Figure 3: Qualitative comparison of dense foot contact estimation with POSA hassan2021populating, BSTRO huang2022capturing, and DECO tripathi2023deco on MOYO tripathi20233d, RICH huang2022capturing, Hi4D yin2023hi4d dataset. Red circles indicate exemplar regions that FECO outperforms previous methods.
  • Figure S1: COFE dataset statistics. We visualize the dataset configuration of our proposed COFE dataset, which consists of foot image samples in OpenPose cao2019openpose, InstaVariety kanazawa2019learning, PennAction zhang2013actemes, and MPII andriluka20142d. We only include training samples.
  • Figure S2: Contact and non-contact distribution of COFE dataset.
  • ...and 6 more figures