Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
Daniel Sungho Jung, Kyoung Mu Lee
TL;DR
Dense foot contact estimation from a single image is hindered by shoe appearance variability and ambiguous ground cues. FECO addresses this by combining shoe style-invariant learning (via shoe style-content randomization and external shoe data) with ground-aware representations (pixel height maps and ground normals) and a Transformer-based decoder for dense, pixel-level contacts. Key contributions include the FECO framework, the dual randomization strategy, explicit ground-geometry supervision, and the COFE dataset, with state-of-the-art performance on MMVP and strong cross-dataset generalization. This work enables more robust interpretation of foot-ground interactions in monocular imagery, with potential benefits for sports analytics, rehabilitation, and AR/VR applications.
Abstract
Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.
