Table of Contents
Fetching ...

Personalized Representation from Personalized Generation

Shobhita Sundaram, Julia Chae, Yonglong Tian, Sara Beery, Phillip Isola

TL;DR

The paper addresses learning personalized visual representations from only a few real images by leveraging synthetic data generated through diffusion-based personalization. It proposes a three-stage pipeline that personalizes a generator (via DreamBooth), synthesizes diverse target-specific images, and fine-tunes a general encoder with a contrastive objective, evaluated on recognition, retrieval, detection, and segmentation. Empirical results across DF2, Dogs, and PODS show consistent gains over pretrained representations, with data-generation strategy and prompt design (including CFG and LLM-generated captions) significantly impacting performance. The work contributes a new dataset (PODS), reformulations of existing datasets, and practical insights for data-efficient, private personalization, including an integration path with PerSAM for dense tasks.

Abstract

Modern vision models excel at general purpose downstream tasks. It is unclear, however, how they may be used for personalized vision tasks, which are both fine-grained and data-scarce. Recent works have successfully applied synthetic data to general-purpose representation learning, while advances in T2I diffusion models have enabled the generation of personalized images from just a few real examples. Here, we explore a potential connection between these ideas, and formalize the challenge of using personalized synthetic data to learn personalized representations, which encode knowledge about an object of interest and may be flexibly applied to any downstream task relating to the target object. We introduce an evaluation suite for this challenge, including reformulations of two existing datasets and a novel dataset explicitly constructed for this purpose, and propose a contrastive learning approach that makes creative use of image generators. We show that our method improves personalized representation learning for diverse downstream tasks, from recognition to segmentation, and analyze characteristics of image generation approaches that are key to this gain.

Personalized Representation from Personalized Generation

TL;DR

The paper addresses learning personalized visual representations from only a few real images by leveraging synthetic data generated through diffusion-based personalization. It proposes a three-stage pipeline that personalizes a generator (via DreamBooth), synthesizes diverse target-specific images, and fine-tunes a general encoder with a contrastive objective, evaluated on recognition, retrieval, detection, and segmentation. Empirical results across DF2, Dogs, and PODS show consistent gains over pretrained representations, with data-generation strategy and prompt design (including CFG and LLM-generated captions) significantly impacting performance. The work contributes a new dataset (PODS), reformulations of existing datasets, and practical insights for data-efficient, private personalization, including an integration path with PerSAM for dense tasks.

Abstract

Modern vision models excel at general purpose downstream tasks. It is unclear, however, how they may be used for personalized vision tasks, which are both fine-grained and data-scarce. Recent works have successfully applied synthetic data to general-purpose representation learning, while advances in T2I diffusion models have enabled the generation of personalized images from just a few real examples. Here, we explore a potential connection between these ideas, and formalize the challenge of using personalized synthetic data to learn personalized representations, which encode knowledge about an object of interest and may be flexibly applied to any downstream task relating to the target object. We introduce an evaluation suite for this challenge, including reformulations of two existing datasets and a novel dataset explicitly constructed for this purpose, and propose a contrastive learning approach that makes creative use of image generators. We show that our method improves personalized representation learning for diverse downstream tasks, from recognition to segmentation, and analyze characteristics of image generation approaches that are key to this gain.

Paper Structure

This paper contains 67 sections, 2 equations, 29 figures, 9 tables.

Figures (29)

  • Figure 1: Learning personalized representations from limited real data. In this paper we explore whether and how synthetic data can be used to train a personalized representation. Given a few real images of an instance, we generate novel images and contrastively fine-tune a general-purpose pretrained model to learn a personalized representation, useful for diverse downstream tasks.
  • Figure 2: Personalized Representation Training Pipeline. Our three-stage training method: 1) Generative Model Training 2) Synthetic Data Generation 3) Contrastive LoRA Fine-Tuning.
  • Figure 3: (left) Examples of instances from our new PODS dataset. We showcase one example instance from each of the five object categories, displaying images from both the training and various test splits. We dim the surrounding scene, highlighting the instance of interest. This masking technique is not applied to our dataset images or during training. (right) We show example generated images from Dreambooth (LLM, cfg 5), which we use as positives in our representation learning finetuning.
  • Figure 4: Inference Pipelines. We visualize the global (classification, retrieval) and local (detection, segmentation) evaluation pipelines. Global inference uses cosine similarity between CLS embeddings, while local inference extracts patch features with spatial information.
  • Figure 5: Qualitative Results. Each triplet shows the test image (left), dense prediction maps for pretrained DINOv2 (center), and personalized (right). Prediction maps are computed via patchwise embedding similarity between the test and localized train images following Figure \ref{['fig:eval']}. Personalized representations distinctly localize the target instance, unlike pretrained embeddings. For visualization only, the personalized instance is highlighted in the test images but this is not applied during training or inference.
  • ...and 24 more figures