Table of Contents
Fetching ...

Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces

Cameron Armijo, Pablo Rivas

TL;DR

This paper investigates the capability of Vision Transformer (ViT)–Base to generate visual embeddings for auto parts images drawn from online marketplaces, using a single-modality approach (images only) to assess patterns potentially related to illicit activity. The authors extract 768-dimensional ViT embeddings for 85,000 images, reduce dimensionality with UMAP, and cluster with k-means (k=20) across multiple embedding sizes (16–128). They find that 64-dimensional embeddings offer the best trade-off between cluster quality and efficiency, producing coherent visual groups (e.g., exteriors, powertrain components) but suffering from cluster overlap and outliers due to the absence of textual or contextual data. Compared with multimodal approaches, single-modality ViT clustering yields substantially lower silhouette scores (e.g., 0.015 vs. 0.3819), highlighting the value of textual metadata in marketplace analysis. The work provides a foundational baseline for image-only clustering in online auto-part marketplaces and suggests future directions toward domain-specific pretraining and hybrid multimodal models to enhance detection of illicit activities.

Abstract

This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations of single-modal approaches in this domain. This work contributes to understanding the role of Vision Transformers in analyzing online marketplaces and offers a foundation for future advancements in detecting fraudulent or illegal activities.

Exploring Visual Embedding Spaces Induced by Vision Transformers for Online Auto Parts Marketplaces

TL;DR

This paper investigates the capability of Vision Transformer (ViT)–Base to generate visual embeddings for auto parts images drawn from online marketplaces, using a single-modality approach (images only) to assess patterns potentially related to illicit activity. The authors extract 768-dimensional ViT embeddings for 85,000 images, reduce dimensionality with UMAP, and cluster with k-means (k=20) across multiple embedding sizes (16–128). They find that 64-dimensional embeddings offer the best trade-off between cluster quality and efficiency, producing coherent visual groups (e.g., exteriors, powertrain components) but suffering from cluster overlap and outliers due to the absence of textual or contextual data. Compared with multimodal approaches, single-modality ViT clustering yields substantially lower silhouette scores (e.g., 0.015 vs. 0.3819), highlighting the value of textual metadata in marketplace analysis. The work provides a foundational baseline for image-only clustering in online auto-part marketplaces and suggests future directions toward domain-specific pretraining and hybrid multimodal models to enhance detection of illicit activities.

Abstract

This study examines the capabilities of the Vision Transformer (ViT) model in generating visual embeddings for images of auto parts sourced from online marketplaces, such as Craigslist and OfferUp. By focusing exclusively on single-modality data, the analysis evaluates ViT's potential for detecting patterns indicative of illicit activities. The workflow involves extracting high-dimensional embeddings from images, applying dimensionality reduction techniques like Uniform Manifold Approximation and Projection (UMAP) to visualize the embedding space, and using K-Means clustering to categorize similar items. Representative posts nearest to each cluster centroid provide insights into the composition and characteristics of the clusters. While the results highlight the strengths of ViT in isolating visual patterns, challenges such as overlapping clusters and outliers underscore the limitations of single-modal approaches in this domain. This work contributes to understanding the role of Vision Transformers in analyzing online marketplaces and offers a foundation for future advancements in detecting fraudulent or illegal activities.

Paper Structure

This paper contains 21 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Example of an interior view from an online auto parts listing, showcasing a car's seating and dashboard. This image represents the type of visual data used for clustering and analysis in this study.
  • Figure 2: Vision Transformer architecture, illustrating the process of dividing an input image into patches, applying a linear projection, and processing through a transformer encoder. The final output is classified using an MLP head. Adapted from dosovitskiy2021imageworth16x16wordsvaswani2023attentionneed.
  • Figure 3: In the proposed methodology, the input data are images that are embedded with a ViT and then analyzed in search of cluster information.
  • Figure 4: UMAP visualization of embeddings reduced to 64 dimensions, illustrating the clustering of images from online auto parts listings. Each color represents a distinct cluster identified using K-Means, revealing patterns and relationships within the dataset.
  • Figure 5: Representative images from posts located near a cluster centroid that appears to represent images of objects that look like wheels.
  • ...and 3 more figures