Table of Contents
Fetching ...

Instance-Level Generation for Representation Learning

Yankun Wu, Zakaria Laskar, Giorgos Kordopatis-Zilos, Noa Garcia, Giorgos Tolias

TL;DR

This work tackles the data bottleneck in instance-level recognition by introducing ILGen, a fully synthetic pipeline that uses an LLM to generate object categories and a generative diffusion model to create diverse object instances, backgrounds, and viewpoints. By training a foundation vision encoder with a retrieval-oriented objective (recall@k) on CKN synthetic data, the method achieves cross-domain ILR improvements across seven benchmarks and demonstrates a new paradigm where only domain names are required as input. The results show synthetic data can outperform real-labeled data in multi-domain retrieval tasks, highlighting the practicality of synthetic ILR for rapid domain adaptation and wide applicability. The approach integrates LLMs, GDMs, and advanced background relighting to produce high-variance, instance-level training sets that improve universal representation learning for ILR.

Abstract

Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR's real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.

Instance-Level Generation for Representation Learning

TL;DR

This work tackles the data bottleneck in instance-level recognition by introducing ILGen, a fully synthetic pipeline that uses an LLM to generate object categories and a generative diffusion model to create diverse object instances, backgrounds, and viewpoints. By training a foundation vision encoder with a retrieval-oriented objective (recall@k) on CKN synthetic data, the method achieves cross-domain ILR improvements across seven benchmarks and demonstrates a new paradigm where only domain names are required as input. The results show synthetic data can outperform real-labeled data in multi-domain retrieval tasks, highlighting the practicality of synthetic ILR for rapid domain adaptation and wide applicability. The approach integrates LLMs, GDMs, and advanced background relighting to produce high-variance, instance-level training sets that improve universal representation learning for ILR.

Abstract

Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR's real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.

Paper Structure

This paper contains 51 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Examples of images generated for learning instance-level representations. Given an object generated by a generative diffusion model (column 1), the foreground is segmented (column 2) and different background variations are added (columns 3 & 4), producing images of the same instance under diverse conditions.
  • Figure 2: Overview of instance-level training data generation. A domain name or description is the only input, which is used to prompt an LLM to provide a list of object category names. Then, we generate examples of those categories using a GDM, remove the background, and synthesize lighting and background multiple times per generated example to create a diverse set of positive images for each instance.
  • Figure 3: Examples of object instances generated by GDM for specific categories. We show the category name, the generated image and the background removal process with using "in a clean background" (columns 1 & 2) and without it (columns 3 & 4).
  • Figure 4: Examples of object instances generated by GDM (column 1), and the generated images that leave the object intact and add lighting and background that is well suited to the object (columns 2 $\sim$ 4).
  • Figure 5: Training batch construction for instance-level representation learning. A batch simulates a retrieval task with a query (blue) and database of positive (green) and negative (red) images. Images are considered positive if they belong to the same class, otherwise they are negatives. An image encoder is trained with metric learning on this batch.
  • ...and 4 more figures