FIND: An Unsupervised Implicit 3D Model of Articulated Human Feet

Oliver Boyne; James Charles; Roberto Cipolla

FIND: An Unsupervised Implicit 3D Model of Articulated Human Feet

Oliver Boyne, James Charles, Roberto Cipolla

TL;DR

This work tackles high-fidelity 3D foot reconstruction under limited supervision by introducing FIND, an implicit neural deformation field that deforms a template foot mesh using disentangled latent codes for shape, texture, and pose (each of length $100$). It combines differentiable rendering and an unsupervised part-based loss derived from StyleGAN2 Restyle features to learn 2D–3D part semantics and imposes a 3D surface constraint on per-vertex part probabilities, enabling flexible, multi-resolution mesh outputs. Foot3D, a high-resolution textured foot dataset, supports training and evaluation, with results showing that FIND outperforms a PCA-based baseline in shape quality and part correspondences, and that the unsupervised part-based loss improves image-based fitting, especially with few views. The approach has practical implications for home health, orthotics design, and online footwear by enabling accurate, interpretable foot models on resource-constrained devices.

Abstract

In this paper we present a high fidelity and articulated 3D human foot model. The model is parameterised by a disentangled latent code in terms of shape, texture and articulated pose. While high fidelity models are typically created with strong supervision such as 3D keypoint correspondences or pre-registration, we focus on the difficult case of little to no annotation. To this end, we make the following contributions: (i) we develop a Foot Implicit Neural Deformation field model, named FIND, capable of tailoring explicit meshes at any resolution i.e. for low or high powered devices; (ii) an approach for training our model in various modes of weak supervision with progressively better disentanglement as more labels, such as pose categories, are provided; (iii) a novel unsupervised part-based loss for fitting our model to 2D images which is better than traditional photometric or silhouette losses; (iv) finally, we release a new dataset of high resolution 3D human foot scans, Foot3D. On this dataset, we show our model outperforms a strong PCA implementation trained on the same data in terms of shape quality and part correspondences, and that our novel unsupervised part-based loss improves inference on images.

FIND: An Unsupervised Implicit 3D Model of Articulated Human Feet

TL;DR

). It combines differentiable rendering and an unsupervised part-based loss derived from StyleGAN2 Restyle features to learn 2D–3D part semantics and imposes a 3D surface constraint on per-vertex part probabilities, enabling flexible, multi-resolution mesh outputs. Foot3D, a high-resolution textured foot dataset, supports training and evaluation, with results showing that FIND outperforms a PCA-based baseline in shape quality and part correspondences, and that the unsupervised part-based loss improves image-based fitting, especially with few views. The approach has practical implications for home health, orthotics design, and online footwear by enabling accurate, interpretable foot models on resource-constrained devices.

Abstract

Paper Structure (39 sections, 7 equations, 9 figures, 3 tables)

This paper contains 39 sections, 7 equations, 9 figures, 3 tables.

Introduction
Related work
Generative shape models.
Implicit shape representations.
Unsupervised representation learning for 3D objects.
Foot reconstruction.
Method
Foot3D Dataset
The FIND model
Positional Encoding.
Network.
Heads.
Defining the surface.
Latent vectors.
Learning part-segmentation
...and 24 more sections

Figures (9)

Figure 1: We produce (1) a coordinate based multilayer perceptron (MLP) architecture, FIND, which receives a 3D vertex position on a template mesh, and a shape, pose and texture embedding, and predicts a corresponding vertex deformation and colour. (2) We train our model with 3D based losses against ground truth scans, and learn a sensible pose space using a contrastive loss (3) We use differentiable rendering at inference to enforce a silhouette based loss, in addition to our novel unsupervised part-based loss.
Figure 2: 5 feet from our dataset and corresponding pose descriptions. The pose types are based on foot articulation described in foot anatomy literature houglum2011brunnstrom - further details in the supplementary
Figure 3: Our implicit surface field is an MLP which takes as input a 3D point queried on a template mesh's surface, and texture, shape and pose embeddings, and provides as output a vertex colour value and displacement.
Figure 4: We show the effect of slicing and masking on our dataset and our differentiable renderer.
Figure 5: 20 clusters on the 8th feature map of our Restyle encoder.
...and 4 more figures

FIND: An Unsupervised Implicit 3D Model of Articulated Human Feet

TL;DR

Abstract

FIND: An Unsupervised Implicit 3D Model of Articulated Human Feet

Authors

TL;DR

Abstract

Table of Contents

Figures (9)