Table of Contents
Fetching ...

SynCDR : Training Cross Domain Retrieval Models with Synthetic Data

Samarth Mishra, Carlos D. Castillo, Hongcheng Wang, Kate Saenko, Venkatesh Saligrama

TL;DR

This work proposes a simple solution to generate synthetic data to fill in these missing category examples across domains, via category preserving translation of images from one visual domain to another, and finds that the latter can generate better replacement synthetic data, leading to more accurate cross-domain retrieval models.

Abstract

In cross-domain retrieval, a model is required to identify images from the same semantic category across two visual domains. For instance, given a sketch of an object, a model needs to retrieve a real image of it from an online store's catalog. A standard approach for such a problem is learning a feature space of images where Euclidean distances reflect similarity. Even without human annotations, which may be expensive to acquire, prior methods function reasonably well using unlabeled images for training. Our problem constraint takes this further to scenarios where the two domains do not necessarily share any common categories in training data. This can occur when the two domains in question come from different versions of some biometric sensor recording identities of different people. We posit a simple solution, which is to generate synthetic data to fill in these missing category examples across domains. This, we do via category preserving translation of images from one visual domain to another. We compare approaches specifically trained for this translation for a pair of domains, as well as those that can use large-scale pre-trained text-to-image diffusion models via prompts, and find that the latter can generate better replacement synthetic data, leading to more accurate cross-domain retrieval models. Our best SynCDR model can outperform prior art by up to 15\%. Code for our work is available at https://github.com/samarth4149/SynCDR .

SynCDR : Training Cross Domain Retrieval Models with Synthetic Data

TL;DR

This work proposes a simple solution to generate synthetic data to fill in these missing category examples across domains, via category preserving translation of images from one visual domain to another, and finds that the latter can generate better replacement synthetic data, leading to more accurate cross-domain retrieval models.

Abstract

In cross-domain retrieval, a model is required to identify images from the same semantic category across two visual domains. For instance, given a sketch of an object, a model needs to retrieve a real image of it from an online store's catalog. A standard approach for such a problem is learning a feature space of images where Euclidean distances reflect similarity. Even without human annotations, which may be expensive to acquire, prior methods function reasonably well using unlabeled images for training. Our problem constraint takes this further to scenarios where the two domains do not necessarily share any common categories in training data. This can occur when the two domains in question come from different versions of some biometric sensor recording identities of different people. We posit a simple solution, which is to generate synthetic data to fill in these missing category examples across domains. This, we do via category preserving translation of images from one visual domain to another. We compare approaches specifically trained for this translation for a pair of domains, as well as those that can use large-scale pre-trained text-to-image diffusion models via prompts, and find that the latter can generate better replacement synthetic data, leading to more accurate cross-domain retrieval models. Our best SynCDR model can outperform prior art by up to 15\%. Code for our work is available at https://github.com/samarth4149/SynCDR .
Paper Structure (23 sections, 3 equations, 9 figures, 12 tables)

This paper contains 23 sections, 3 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Motivation. Cross Domain Retrieval problems show up in many applications, and prior work has developed solutions in different scenarios, including when labeled data is absent kim2021cds. What these solutions rely on however is the presence of same category data in both domains so similar pairs can be discovered. When these are missing, such approaches can fail. Our solution is to make up for these missing examples using synthetic data (generated via label preserving translation). While such data may not be a perfect replacement (e.g. the real image generated from the painting in the above example may not be entirely realistic because of a white background), we show that they are still useful for training cross-domain retrieval models.
  • Figure 2: Synthetic examples from different translation methods. We compare synthetic data generated using 4 methods. Since ELITE is not specifically restricted to closely mimicking the original image, we can generate more natural examples of target domain data, which serves as better synthetic replacement for missing real data. For more discussion, refer to \ref{['subsec:syn_methods']}.
  • Figure 3: ELITE Generated examples with and without textual inversion. When generating paintings or cliparts from sketches we found that using a textual inversion token encoding the domain properties leads to poorer category retention in the generated image, and hence leading to poorer performance in general. Textual Inversion can however be useful in scenarios when the domain cannot be textually described.
  • Figure 4: Feature visualization before and after training. The examples (from test set of DomainNet Clipart and Painting) get better clustered and more aligned across domains after training.
  • Figure 5: Different Edit Strengths for translation using Img2Img in the CUB dataset. We see that higher edit strengths can make the output resemble paintings more, but they can drastically change the image contents such that the category is not preserved.
  • ...and 4 more figures