Table of Contents
Fetching ...

Image Similarity using An Ensemble of Context-Sensitive Models

Zukang Liao, Min Chen

TL;DR

This work addresses the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data by using an ensemble model and demonstrates that context-based labelling and model training can be effective when an appropriate ensemble approach is used to alleviate the limitation due to sparse sampling.

Abstract

Image similarity has been extensively studied in computer vision. In recent years, machine-learned models have shown their ability to encode more semantics than traditional multivariate metrics. However, in labelling semantic similarity, assigning a numerical score to a pair of images is impractical, making the improvement and comparisons on the task difficult. In this work, we present a more intuitive approach to build and compare image similarity models based on labelled data in the form of A:R vs B:R, i.e., determining if an image A is closer to a reference image R than another image B. We address the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data by using an ensemble model. Our testing results show that the ensemble model constructed performs ~5% better than the best individual context-sensitive models. They also performed better than the models that were directly fine-tuned using mixed imagery data as well as existing deep embeddings, e.g., CLIP and DINO. This work demonstrates that context-based labelling and model training can be effective when an appropriate ensemble approach is used to alleviate the limitation due to sparse sampling.

Image Similarity using An Ensemble of Context-Sensitive Models

TL;DR

This work addresses the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data by using an ensemble model and demonstrates that context-based labelling and model training can be effective when an appropriate ensemble approach is used to alleviate the limitation due to sparse sampling.

Abstract

Image similarity has been extensively studied in computer vision. In recent years, machine-learned models have shown their ability to encode more semantics than traditional multivariate metrics. However, in labelling semantic similarity, assigning a numerical score to a pair of images is impractical, making the improvement and comparisons on the task difficult. In this work, we present a more intuitive approach to build and compare image similarity models based on labelled data in the form of A:R vs B:R, i.e., determining if an image A is closer to a reference image R than another image B. We address the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data by using an ensemble model. Our testing results show that the ensemble model constructed performs ~5% better than the best individual context-sensitive models. They also performed better than the models that were directly fine-tuned using mixed imagery data as well as existing deep embeddings, e.g., CLIP and DINO. This work demonstrates that context-based labelling and model training can be effective when an appropriate ensemble approach is used to alleviate the limitation due to sparse sampling.
Paper Structure (19 sections, 1 equation, 9 figures, 7 tables)

This paper contains 19 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Each arrow points to the candidate which is considered closer to the reference by the model(s) or human annotation. Visual similarity scores computed by deep models are not always aligned with human annotations. All data, annotations, and source code used for this work can be found in https://github.com/Zukang-Liao/Context-Sensitive-Image-Similarity
  • Figure 2: Given a training set of random triples that are annotated which candidate is semantically closer to the reference, can a model learn from the training data and predict correctly for unseen triples (i.e., unseen reference images and unseen candidates)?
  • Figure 3: Workflow overview: each CS-model is trained on a CS data cluster. An analytical ensemble model is obtained based on the performance of each CS-model on the validation set. We also train global models using amalgamated data from the validation set and CS clusters for comparisons.
  • Figure 4: To train each CS model, we concatenate the embeddings and train a small ranking block to conduct binary classification. The cross-entropy loss of the ranking block, triplet loss, and LoRA lora are used to assist in fine-tuning the backbone.
  • Figure 5: Ensemble Approach (PCA): for all triples $(R_i, x_a, x_b)$ sharing the same reference image $R_i$, we compute an accuracy score from each model. We visualize the accuracy scores of the $| \mathbb{T}_V |$ reference images in our validation set using PCA or tSNE. Different models perform well in different areas. An ensemble method can be obtained based on the scatter plots.
  • ...and 4 more figures