Table of Contents
Fetching ...

Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Dmitrii Pozdeev, Alexey Artemov, Ananta R. Bhattarai, Artem Sevastopolsky

TL;DR

DenseMarks introduces a dense, geometry-aware embedding for human heads that maps per-pixel image information into a canonical $3$D unit cube. The method uses a ViT-based embedder to predict per-voxel embeddings from images and stores semantic features in a learnable latent grid $E \in \mathbb{R}^{(N_d)^3 \times D}$, smoothed by a $3$D Gaussian and queried via TriLerp. Training leverages pairwise 2D point tracks from an off-the-shelf tracker with a contrastive loss, plus landmark anchoring and segmentation supervision to enforce structure, interpretability, and completeness. The resulting embeddings enable robust dense correspondences, improved monocular head tracking with a FLAME-based 3DMM, and versatile applications like dense warping and stereo reconstruction, while maintaining a compact, interpretable canonical space suitable for interactive querying.

Abstract

We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

TL;DR

DenseMarks introduces a dense, geometry-aware embedding for human heads that maps per-pixel image information into a canonical D unit cube. The method uses a ViT-based embedder to predict per-voxel embeddings from images and stores semantic features in a learnable latent grid , smoothed by a D Gaussian and queried via TriLerp. Training leverages pairwise 2D point tracks from an off-the-shelf tracker with a contrastive loss, plus landmark anchoring and segmentation supervision to enforce structure, interpretability, and completeness. The resulting embeddings enable robust dense correspondences, improved monocular head tracking with a FLAME-based 3DMM, and versatile applications like dense warping and stereo reconstruction, while maintaining a compact, interpretable canonical space suitable for interactive querying.

Abstract

We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

Paper Structure

This paper contains 13 sections, 4 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Our method learns to embed a human head image into a semantics-aware volumetric representation based on a large collection of in-the-wild talking head videos annotated by an off-the-shelf point tracker (left). The embeddings can be estimated in a feedforward way and used for downstream applications, such as monocular tracking (right), stereo reconstruction, and many others.
  • Figure 2: To learn our representation, we train an embedder network $\phi_\theta$ in a siamese fashion. By feeding two image frames from a talking head video of the same person into the embedder independently, we obtain DenseMarks embeddings $I_C^1, I_C^2$. These embeddings correspond to canonical locations in the unit cube (DenseMarks space). This cube is discretized in advance, and a learnable matrix $E$ of latent features represents $D$-dimensional vectors, storing semantic info of each of the voxel grid locations. To transform each of the estimated cube locations into semantic features $\textrm{Feat}^1, \textrm{Feat}^2$, we query $E$ at locations $\mathrm{I}_C^1$, $\mathrm{I}_C^2$ via trilinear interpolation (TriLerp). For the images $I_1, I_2$, we have a set of pair matches $K_\textrm{gt}^1, K_\textrm{gt}^2$, estimated by an off-the-shelf point tracker karaev2024cotracker3. We apply contrastive loss clip to the semantic features of images in these locations. This way, the cube locations corresponding to the same semantic feature are pushed closer together. Additionally, we estimate region masks $S^1, S^2$ by a semantic network $S_\xi$ and apply segmentation loss.
  • Figure 3: Point querying. We select a specific point on a few images and find the reference embedding by averaging the embeddings predicted by each of the models in its location. Points: red = on the left side of long hair region, green = center of the right ear, orange = center of the left ear, blue = forehead center, yellow = left eyebrow corner. We indicate the embedding dimension in brackets.
  • Figure 4: Semantic regions on head images can be located via selecting corresponding volumetric regions in the canonical space. Blue: forehead center, green and orange: ears, yellow: skin near the left eyebrow corner.
  • Figure 5: Dense warping. Here, we copy pixels from source to target based on the target$\rightarrow$source nearest neighbors search in the space of embeddings, predicted by each model (even rows). For clarity, mapping of meshgrid-like coordinates, blended with RGB, is shown additionally (odd rows). Even though deep feature extractors provide valuable matches, they are either matching colors, not semantics (Sapiens khirodkar2024sapiens, DHFeats dhf), or feature significant artifacts (DinoV3 simeoni2025dinov3, Fit3D yue2024improving), thus being less reliable for matching. Numbers in parentheses for each method correspond to the dimension of the embedding.
  • ...and 8 more figures