Table of Contents
Fetching ...

Local Aggregation for Unsupervised Learning of Visual Embeddings

Chengxu Zhuang, Alex Lin Zhai, Daniel Yamins

TL;DR

This paper tackles unsupervised learning for large-scale visual recognition by introducing Local Aggregation (LA), a non-parametric embedding objective that jointly learns representations and a soft, multi-scale clustering structure. By maintaining per-example close and background neighbor sets and a memory-bank-based training regime, LA promotes tight local groups while preserving dispersion at larger scales. Empirical results show state-of-the-art unsupervised transfer on ImageNet and Places205, and strong performance on PASCAL VOC object detection, with deeper networks yielding larger gains. The work demonstrates that balancing local clustering and separation can yield highly transferable visual embeddings without labels, with broad implications for scalable representation learning.

Abstract

Unsupervised approaches to learning in neural networks are of substantial interest for furthering artificial intelligence, both because they would enable the training of networks without the need for large numbers of expensive annotations, and because they would be better models of the kind of general-purpose learning deployed by humans. However, unsupervised networks have long lagged behind the performance of their supervised counterparts, especially in the domain of large-scale visual recognition. Recent developments in training deep convolutional embeddings to maximize non-parametric instance separation and clustering objectives have shown promise in closing this gap. Here, we describe a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate. This aggregation metric is dynamic, allowing soft clusters of different scales to emerge. We evaluate our procedure on several large-scale visual recognition datasets, achieving state-of-the-art unsupervised transfer learning performance on object recognition in ImageNet, scene recognition in Places 205, and object detection in PASCAL VOC.

Local Aggregation for Unsupervised Learning of Visual Embeddings

TL;DR

This paper tackles unsupervised learning for large-scale visual recognition by introducing Local Aggregation (LA), a non-parametric embedding objective that jointly learns representations and a soft, multi-scale clustering structure. By maintaining per-example close and background neighbor sets and a memory-bank-based training regime, LA promotes tight local groups while preserving dispersion at larger scales. Empirical results show state-of-the-art unsupervised transfer on ImageNet and Places205, and strong performance on PASCAL VOC object detection, with deeper networks yielding larger gains. The work demonstrates that balancing local clustering and separation can yield highly transferable visual embeddings without labels, with broad implications for scalable representation learning.

Abstract

Unsupervised approaches to learning in neural networks are of substantial interest for furthering artificial intelligence, both because they would enable the training of networks without the need for large numbers of expensive annotations, and because they would be better models of the kind of general-purpose learning deployed by humans. However, unsupervised networks have long lagged behind the performance of their supervised counterparts, especially in the domain of large-scale visual recognition. Recent developments in training deep convolutional embeddings to maximize non-parametric instance separation and clustering objectives have shown promise in closing this gap. Here, we describe a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate. This aggregation metric is dynamic, allowing soft clusters of different scales to emerge. We evaluate our procedure on several large-scale visual recognition datasets, achieving state-of-the-art unsupervised transfer learning performance on object recognition in ImageNet, scene recognition in Places 205, and object detection in PASCAL VOC.

Paper Structure

This paper contains 20 sections, 5 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of the Local Aggregation (LA) method. For each input image, we use a deep neural network to embed it into a lower dimension space ("Embedding Space" panel). We then identify its close neighbors (blue dots) and background neighbors (black dots). The optimization seeks to push the current embedding vector (red dot) closer to its close neighbors and further from its background neighbors. The blue arrow and black arrow are examples of influences from different neighbors on the current embedding during optimization. The "After Optimization" panel illustrates the typical structure of the final embedding after training.
  • Figure 1:
  • Figure 2: Distributions across all ImageNet training images of local and background densities for feature embeddings. We compare features from ResNet-18 (orange bars) and Resnet-50 (green bars) architectures as trained by the LA method, as well as that of a ResNet-18 architecture trained by the Instance Recognition (IR) method (blue bars). The local and background densities at each embedded vector are estimated by averaging dot products between that vector and, respectively, its top 30 or its 1000th-4096th, nearest neighbors in $\mathbf{\bar{V}}$. See supplementary material for more detail.
  • Figure 3: For each of several validation images in the left-most column, nearest neighbors in LA-trained RestNet-50 embedding, with similarity decreasing from left to right. The three top columns are successfully-classified cases, with high KNN-classifier confidence, while the lower three are failure cases, with low KNN-classifier confidence.
  • Figure 4: Multi-dimensional scaling (MDS) embedding results for network outputs of classes with high validation accuracy (left panel) and classes with low validation accuracy (right panel). For each class, we randomly choose 100 images of that class from the training set and apply the MDS algorithm to the resulting 600 images. Dots represent individual images in each color-coded category. Gray boxes show examples of images from a single class ("trombone") that have been embedded in two distinct subclusters.