Table of Contents
Fetching ...

FIND: An Unsupervised Implicit 3D Model of Articulated Human Feet

Oliver Boyne, James Charles, Roberto Cipolla

TL;DR

This work tackles high-fidelity 3D foot reconstruction under limited supervision by introducing FIND, an implicit neural deformation field that deforms a template foot mesh using disentangled latent codes for shape, texture, and pose (each of length $100$). It combines differentiable rendering and an unsupervised part-based loss derived from StyleGAN2 Restyle features to learn 2D–3D part semantics and imposes a 3D surface constraint on per-vertex part probabilities, enabling flexible, multi-resolution mesh outputs. Foot3D, a high-resolution textured foot dataset, supports training and evaluation, with results showing that FIND outperforms a PCA-based baseline in shape quality and part correspondences, and that the unsupervised part-based loss improves image-based fitting, especially with few views. The approach has practical implications for home health, orthotics design, and online footwear by enabling accurate, interpretable foot models on resource-constrained devices.

Abstract

In this paper we present a high fidelity and articulated 3D human foot model. The model is parameterised by a disentangled latent code in terms of shape, texture and articulated pose. While high fidelity models are typically created with strong supervision such as 3D keypoint correspondences or pre-registration, we focus on the difficult case of little to no annotation. To this end, we make the following contributions: (i) we develop a Foot Implicit Neural Deformation field model, named FIND, capable of tailoring explicit meshes at any resolution i.e. for low or high powered devices; (ii) an approach for training our model in various modes of weak supervision with progressively better disentanglement as more labels, such as pose categories, are provided; (iii) a novel unsupervised part-based loss for fitting our model to 2D images which is better than traditional photometric or silhouette losses; (iv) finally, we release a new dataset of high resolution 3D human foot scans, Foot3D. On this dataset, we show our model outperforms a strong PCA implementation trained on the same data in terms of shape quality and part correspondences, and that our novel unsupervised part-based loss improves inference on images.

FIND: An Unsupervised Implicit 3D Model of Articulated Human Feet

TL;DR

This work tackles high-fidelity 3D foot reconstruction under limited supervision by introducing FIND, an implicit neural deformation field that deforms a template foot mesh using disentangled latent codes for shape, texture, and pose (each of length ). It combines differentiable rendering and an unsupervised part-based loss derived from StyleGAN2 Restyle features to learn 2D–3D part semantics and imposes a 3D surface constraint on per-vertex part probabilities, enabling flexible, multi-resolution mesh outputs. Foot3D, a high-resolution textured foot dataset, supports training and evaluation, with results showing that FIND outperforms a PCA-based baseline in shape quality and part correspondences, and that the unsupervised part-based loss improves image-based fitting, especially with few views. The approach has practical implications for home health, orthotics design, and online footwear by enabling accurate, interpretable foot models on resource-constrained devices.

Abstract

In this paper we present a high fidelity and articulated 3D human foot model. The model is parameterised by a disentangled latent code in terms of shape, texture and articulated pose. While high fidelity models are typically created with strong supervision such as 3D keypoint correspondences or pre-registration, we focus on the difficult case of little to no annotation. To this end, we make the following contributions: (i) we develop a Foot Implicit Neural Deformation field model, named FIND, capable of tailoring explicit meshes at any resolution i.e. for low or high powered devices; (ii) an approach for training our model in various modes of weak supervision with progressively better disentanglement as more labels, such as pose categories, are provided; (iii) a novel unsupervised part-based loss for fitting our model to 2D images which is better than traditional photometric or silhouette losses; (iv) finally, we release a new dataset of high resolution 3D human foot scans, Foot3D. On this dataset, we show our model outperforms a strong PCA implementation trained on the same data in terms of shape quality and part correspondences, and that our novel unsupervised part-based loss improves inference on images.
Paper Structure (39 sections, 7 equations, 9 figures, 3 tables)

This paper contains 39 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We produce (1) a coordinate based multilayer perceptron (MLP) architecture, FIND, which receives a 3D vertex position on a template mesh, and a shape, pose and texture embedding, and predicts a corresponding vertex deformation and colour. (2) We train our model with 3D based losses against ground truth scans, and learn a sensible pose space using a contrastive loss (3) We use differentiable rendering at inference to enforce a silhouette based loss, in addition to our novel unsupervised part-based loss.
  • Figure 2: 5 feet from our dataset and corresponding pose descriptions. The pose types are based on foot articulation described in foot anatomy literature houglum2011brunnstrom - further details in the supplementary
  • Figure 3: Our implicit surface field is an MLP which takes as input a 3D point queried on a template mesh's surface, and texture, shape and pose embeddings, and provides as output a vertex colour value and displacement.
  • Figure 4: We show the effect of slicing and masking on our dataset and our differentiable renderer.
  • Figure 5: 20 clusters on the 8th feature map of our Restyle encoder.
  • ...and 4 more figures