Table of Contents
Fetching ...

Omni-ID: Holistic Identity Representation Designed for Generative Tasks

Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman

TL;DR

Omni-ID presents a fixed-size, structured facial identity encoding designed for generative tasks by aggregating multiple images of an individual into a single representation. It combines a transformer-based Omni-ID Encoder with a few-to-many identity reconstruction paradigm and a dual-decoder setup (Masked Transformer Decoder and Flow-Matching) to capture holistic identity features across poses and expressions. Trained on the MFHQ dataset, Omni-ID demonstrates superior identity fidelity in controllable face generation and personalized text-to-image generation compared with discriminative baselines like ArcFace and CLIP. The approach offers scalable identity-preserving generation and opens avenues for richer, subject-specific synthesis, while highlighting areas for extension to non-facial attributes and broader dataset enhancements.

Abstract

We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach uses a few-to-many identity reconstruction training paradigm, where a limited set of input images is used to reconstruct multiple target images of the same individual in various poses and expressions. A multi-decoder framework is further employed to leverage the complementary strengths of diverse decoders during training. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.

Omni-ID: Holistic Identity Representation Designed for Generative Tasks

TL;DR

Omni-ID presents a fixed-size, structured facial identity encoding designed for generative tasks by aggregating multiple images of an individual into a single representation. It combines a transformer-based Omni-ID Encoder with a few-to-many identity reconstruction paradigm and a dual-decoder setup (Masked Transformer Decoder and Flow-Matching) to capture holistic identity features across poses and expressions. Trained on the MFHQ dataset, Omni-ID demonstrates superior identity fidelity in controllable face generation and personalized text-to-image generation compared with discriminative baselines like ArcFace and CLIP. The approach offers scalable identity-preserving generation and opens avenues for richer, subject-specific synthesis, while highlighting areas for extension to non-facial attributes and broader dataset enhancements.

Abstract

We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach uses a few-to-many identity reconstruction training paradigm, where a limited set of input images is used to reconstruct multiple target images of the same individual in various poses and expressions. A multi-decoder framework is further employed to leverage the complementary strengths of diverse decoders during training. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.

Paper Structure

This paper contains 22 sections, 4 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Omni-ID is a facial representation that consolidates information from a varied number of images of an individual into a fixed-size, structured encoding. Each element of this encoding captures specific global or local identity features, enabling high-fidelity generation in new poses, expressions, and capturing identity-consistent variations.
  • Figure 2: Face generation comparison of different facial representations with single input (top row) and two inputs (bottom row). We evaluate different facial representations by training an IP-adapter ye2023ip-adapter on FLUX FLUX with each representation. It can be seen that single-instance representations such as ArcFace and CLIP struggle to combine unique features appear in each observation (e.g., eye color and nose shape), whereas our Omni-ID, designed with a few-to-many generative objective, improves identity representation with each additional view, unifying unique attribute from multiple views into a single representation.
  • Figure 3: Omni-ID employs a multi-decoder few-to-many identity reconstruction training strategy, incorporating three key design features: (1) An encoder that learns a unified, fixed-size identity representation from a varied number of inputs; (2) A few-to-many identity reconstruction task, designed to generate multiple faces of an individual in various poses and expressions from a limited set of samples of the same individual; (3) A multi-decoder training strategy that combines the unique strengths of various decoders while mitigating the limitations of any single decoder.
  • Figure 4: Omni-ID Encoder receives a set of images of an individual, projects them into keys and values, which are then fed into cross-attention layers. These layers attend to \ref{['fig:attention']} that are semantic-aware, allowing the encoder to capture shared identity features across images. Self-attention layers refine these interactions further, producing a holistic representation $\ell$.
  • Figure 5: Multi-decoder training. (left) Masked Transformer Decoder (MTD) is designed to reconstruct unseen facial pixels from the Omni-ID representation and a minimal subset of visible pixels which do not leak identity. (right) Flow Matching Decoder enhances the encoder by a higher-quality reconstruction task.
  • ...and 9 more figures