Table of Contents
Fetching ...

Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

Ryan Ramos, Vladan Stojnić, Giorgos Kordopatis-Zilos, Yuta Nakashima, Giorgos Tolias, Noa Garcia

Abstract

Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces

Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

Abstract

Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces

Paper Structure

This paper contains 36 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Impact of metadata on the representation space of visual encoders. The similarity in the representation space of foundation visual encoders is influenced not only by semantic labels but also by metadata labels. This work explores and uncovers these effects for metadata related to image processing (e.g., JPEG compression) and image acquisition (e.g., camera model). The figure illustrates the search results for the query image with specific metadata labels. Retrieved images are ranked according to their similarity to the query. Different image collections exhibit varied combinations of semantic and metadata labels, affecting the retrieval outcome.
  • Figure 2: Distribution of similarities with respect to a query image. Four distributions are shown, based on whether the semantic and metadata labels of the images match those of the query. Metadata labels are based on JPEG quality. The results highlight that similarity is influenced by both types of labels.
  • Figure 3: Examples of images in the PairCams dataset. Each pair depicts the same object and/or scene but taken with two different camera types. For each pair, the left image corresponds to a non-smartphone, and the right one to a smartphone.
  • Figure 4: Image processing-based label prediction. Classification accuracy using a linear classifier on embeddings of different frozen visual encoders on ImageNet (top) and iNaturalist (bottom) datasets. Ordering is according to \ref{['tab:models']} in the supplementary material.
  • Figure 5: ImageNet validation accuracy at different masking ratios. 95% is enough to remove the ability for successful semantic label predictions, reducing the leakage of semantic cues into the acquisition-related label prediction task. At the bottom are visual examples of different masking ratios.
  • ...and 8 more figures