Table of Contents
Fetching ...

Bottleneck-based Encoder-decoder ARchitecture (BEAR) for Learning Unbiased Consumer-to-Consumer Image Representations

Pablo Rivas, Gisela Bichler, Tomas Cerny, Laurie Giddens, Stacie Petter

TL;DR

The paper addresses learning unbiased, privacy-preserving image representations for consumer-to-consumer (C2C) imagery to aid illicit-activity detection. It introduces BEAR, a bottleneck-based encoder–decoder autoencoder that combines ConvLSTM-based perceptual encoding, residual feature entanglement, and a multi-branch decoder to produce compact latent representations. Training on roughly 2 million 128×128 color images and evaluating on C2C, CIFAR-10, and ImageNet demonstrates convergent learning, meaningful latent clustering via k-means, and informative visualizations with UMAP, while preserving privacy by obscuring personal identifiers. The authors argue for a lightweight, less-label-biased alternative to transformer-based or contrastive models and propose future multimodal expansion with text and contrastive learning to build a trafficking-detection pipeline.

Abstract

Unbiased representation learning is still an object of study under specific applications and contexts. Novel architectures are usually crafted to resolve particular problems using mixtures of fundamental pieces. This paper presents different image feature extraction mechanisms that work together with residual connections to encode perceptual image information in an autoencoder configuration. We use image data that aims to support a larger research agenda dealing with issues regarding criminal activity in consumer-to-consumer online platforms. Preliminary results suggest that the proposed architecture can learn rich spaces using ours and other image datasets resolving important challenges that are identified.

Bottleneck-based Encoder-decoder ARchitecture (BEAR) for Learning Unbiased Consumer-to-Consumer Image Representations

TL;DR

The paper addresses learning unbiased, privacy-preserving image representations for consumer-to-consumer (C2C) imagery to aid illicit-activity detection. It introduces BEAR, a bottleneck-based encoder–decoder autoencoder that combines ConvLSTM-based perceptual encoding, residual feature entanglement, and a multi-branch decoder to produce compact latent representations. Training on roughly 2 million 128×128 color images and evaluating on C2C, CIFAR-10, and ImageNet demonstrates convergent learning, meaningful latent clustering via k-means, and informative visualizations with UMAP, while preserving privacy by obscuring personal identifiers. The authors argue for a lightweight, less-label-biased alternative to transformer-based or contrastive models and propose future multimodal expansion with text and contrastive learning to build a trafficking-detection pipeline.

Abstract

Unbiased representation learning is still an object of study under specific applications and contexts. Novel architectures are usually crafted to resolve particular problems using mixtures of fundamental pieces. This paper presents different image feature extraction mechanisms that work together with residual connections to encode perceptual image information in an autoencoder configuration. We use image data that aims to support a larger research agenda dealing with issues regarding criminal activity in consumer-to-consumer online platforms. Preliminary results suggest that the proposed architecture can learn rich spaces using ours and other image datasets resolving important challenges that are identified.
Paper Structure (8 sections, 2 equations, 9 figures, 1 table)

This paper contains 8 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Proposed residual autoencoder that uses convolutional LSTMs for perceptual information extraction and compression.
  • Figure 2: The Perceptual Feature Encoder architecture that reduces dimensions while extracting perceptual information.
  • Figure 3: The Residual Feature Entanglement that uses as input both the previous layer information and the residual, preserving dimensions.
  • Figure 4: The BFE uses a convolutional LSTM and a dense layer.
  • Figure 5: The DE reconstructs feature maps using a dense layer.
  • ...and 4 more figures