Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

Thomas M Metz; Matthew Q Hill; Alice J O'Toole

Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

Thomas M Metz, Matthew Q Hill, Alice J O'Toole

TL;DR

This work addresses the challenge of unifying object, face (HQ and LQ), and whole-body recognition in a single embedding while avoiding catastrophic forgetting. It introduces the Interleaved Multi-Domain Identity Curriculum (IMIC) with two variants, IMIC-B and IMIC-A, and demonstrates its effectiveness by fine-tuning three foundation models (EVA-02, CLIP, DINOv3) to perform all four tasks in one space. The results show that EVA-02 and CLIP with IMIC achieve super-human multitasking on several datasets and preserve out-of-distribution generalization, with the embedding space exhibiting linear separability per task yet substantial feature sharing; as few as ~100 PCs from one task suffice to support others. These findings imply that well-designed interleaved curricula can yield unified, robust representations for diverse recognition tasks, challenging the notion of strict domain specialization and offering practical benefits for multi-task vision systems.

Abstract

Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space -- without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.

Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

TL;DR

Abstract

Face, Whole-Person, and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)