Table of Contents
Fetching ...

Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery

Siddharth Tourani, Ahmed Alwheibi, Arif Mahmood, Muhammad Haris Khan

TL;DR

This work tackles unsupervised landmark discovery for object categories by leveraging pre-trained diffusion-model representations to reveal implicit correspondences. It starts with a strong zero-shot baseline based on clustering random pixel descriptors from diffusion features and nearest-neighbor labeling, then builds D-ULD, a diffusion-based ULD method with self-training and clustering, followed by D-ULD++, which adds a pose-guided proxy task and a two-stage clustering scheme. The proposed methods achieve state-of-the-art performance across AFLW, MAFL, CatHeads, and LS3D, with substantial improvements in forward and backward NME and improved landmark consistency under diverse poses. The approach demonstrates the practical potential of diffusion-model internals for robust, unsupervised landmark discovery in challenging, variable-display datasets.

Abstract

Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for the ULD task, we make the following core contributions. First, we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than existing ULD methods. Second, motivated by the ZeroShot performance, we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third, we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling, resulting in a significant performance improvement. Overall, our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by significant margins.

Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery

TL;DR

This work tackles unsupervised landmark discovery for object categories by leveraging pre-trained diffusion-model representations to reveal implicit correspondences. It starts with a strong zero-shot baseline based on clustering random pixel descriptors from diffusion features and nearest-neighbor labeling, then builds D-ULD, a diffusion-based ULD method with self-training and clustering, followed by D-ULD++, which adds a pose-guided proxy task and a two-stage clustering scheme. The proposed methods achieve state-of-the-art performance across AFLW, MAFL, CatHeads, and LS3D, with substantial improvements in forward and backward NME and improved landmark consistency under diverse poses. The approach demonstrates the practical potential of diffusion-model internals for robust, unsupervised landmark discovery in challenging, variable-display datasets.

Abstract

Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for the ULD task, we make the following core contributions. First, we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than existing ULD methods. Second, motivated by the ZeroShot performance, we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third, we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling, resulting in a significant performance improvement. Overall, our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by significant margins.
Paper Structure (16 sections, 5 equations, 12 figures, 4 tables)

This paper contains 16 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: (Left) Visual comparison of proposed D-ULD++ with SOTA. (Middle) Mapping of various SOTA methods to NME space. (Right) D-ULD++ obtains minimum errors across yaw-angle ranges on AFLW Dataset. (Awan et al. awan2023unsupervised, Jakab et al.jakab2018unsupervised, Mallis et al. mallis2023keypoints, Sanchez et al.sanchez2019object, Zhang et al. zhang2018unsupervised). The NME metrics are explained in \ref{['sec:experiments']}.
  • Figure 2: Proposed diffusion based unsupervised landmark detection algorithm D-ULD++: (a) Pose-guided proxy task to reduce noisy landmarks. (b) Two-stage clustering to improve pseudo-labels. (c) Self-training using pseudo-labels.
  • Figure 3: Comparisons of ZeroShot, D-ULD and D-ULD++. (a) Visual results on exemplar images showing different colored keypoints. (b) Yaw angle split of fwd. and bwd. errors (NME%) for AFLW dataset. Mallis mallis2023keypoints is shown for additional comparison.
  • Figure 4: Evaluation of the ability of raw unsupervised landmarks to capture supervised landmark locations on MAFL. Each unsupervised landmark is mapped to the best corresponding supervised landmark using the Hungarian Algorithm. Then accuracy is calculated for a distance threshold of $0.2\cdot$$d_{iod}$ to a landmark location, where $d_{iod}$ is the interocular distance. Accuracy is shown for each of the 68-facial landmarks sorted by ascending order of index. Different landmark areas are highlighted with different colours and labelled as such (1-17 face contour, 18-27 eyebrows etc.
  • Figure 5: Cumulative Error Distribution (CED) Curves of forward and backward NME for MAFL and LS3D.
  • ...and 7 more figures