Table of Contents
Fetching ...

Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

Thomas M Metz, Matthew Q Hill, Alice J O'Toole

TL;DR

This work addresses the challenge of unifying object, face (HQ and LQ), and whole-body recognition in a single embedding while avoiding catastrophic forgetting. It introduces the Interleaved Multi-Domain Identity Curriculum (IMIC) with two variants, IMIC-B and IMIC-A, and demonstrates its effectiveness by fine-tuning three foundation models (EVA-02, CLIP, DINOv3) to perform all four tasks in one space. The results show that EVA-02 and CLIP with IMIC achieve super-human multitasking on several datasets and preserve out-of-distribution generalization, with the embedding space exhibiting linear separability per task yet substantial feature sharing; as few as ~100 PCs from one task suffice to support others. These findings imply that well-designed interleaved curricula can yield unified, robust representations for diverse recognition tasks, challenging the notion of strict domain specialization and offering practical benefits for multi-task vision systems.

Abstract

Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space -- without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

TL;DR

This work addresses the challenge of unifying object, face (HQ and LQ), and whole-body recognition in a single embedding while avoiding catastrophic forgetting. It introduces the Interleaved Multi-Domain Identity Curriculum (IMIC) with two variants, IMIC-B and IMIC-A, and demonstrates its effectiveness by fine-tuning three foundation models (EVA-02, CLIP, DINOv3) to perform all four tasks in one space. The results show that EVA-02 and CLIP with IMIC achieve super-human multitasking on several datasets and preserve out-of-distribution generalization, with the embedding space exhibiting linear separability per task yet substantial feature sharing; as few as ~100 PCs from one task suffice to support others. These findings imply that well-designed interleaved curricula can yield unified, robust representations for diverse recognition tasks, challenging the notion of strict domain specialization and offering practical benefits for multi-task vision systems.

Abstract

Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space -- without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

Paper Structure

This paper contains 29 sections, 8 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: In the IMIC-A training procedure, the batch loss is aggregated as the sum of task losses. All tasks share a common loss function (Triplet Loss in our experiments). Online accuracy on prior training samples for each task informs the frequency at which that task is sampled in future training steps.
  • Figure 2: Human Multi-task Index compares of model multi-tasking performance with human multi-tasking performance (left). Expert Multi-task Index shows comparisons with the expert domain models (right). IMIC models are the first models to surpass humans on face, body, and object classification concurrently. These models approach parity with domain experts.
  • Figure 3: Interleaved training can offset catastrophic forgetting. Interleaved models exhibit substantially smaller performance drops on out of distribution (OOD) object classification data as compared to a model trained with body data. Two of the models even improved at object classification with interleaved multi-task training.
  • Figure 4: Task separation accuracy as a function of PCs isolating subspaces with a sliding window that moves from early to later PCs. Early PCs are highly effective at task separation.
  • Figure 5: $t$-SNE visualization of raw embeddings generated from BTS (body identification), ImageNet (object classification), LFW (HQ face identification), and TinyFace (LQ face identification). Left: PCs 1--800. Right: PCs 4--800. Perplexity = 30 in both plots. Perplexity values ranging from 10 to 1,000 showed no difference in overall task separability.
  • ...and 2 more figures