Table of Contents
Fetching ...

Graph-RISE: Graph-Regularized Image Semantic Embedding

Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew Tomkins, Sujith Ravi

TL;DR

Graph-RISE tackles the challenge of ultra-fine image semantics by reframing embedding learning as large-scale classification augmented with neural graph regularization. It introduces a graph-regularized training objective and a Graph-RISE graph that encodes image-image relationships via co-click and similar-image signals, trained on roughly 260M images and 40M labels. The approach yields substantial improvements over state-of-the-art models on kNN and triplet evaluations for ImageNet, iNaturalist, and related datasets, and qualitative results show closer alignment with human perception. This work demonstrates that combining massive-scale labeled data with graph-regularized neural networks can produce instance-level embeddings with practical benefits for search and ranking.

Abstract

Learning image representations to capture fine-grained semantics has been a challenging and important task enabling many applications such as image search and clustering. In this paper, we present Graph-Regularized Image Semantic Embedding (Graph-RISE), a large-scale neural graph learning framework that allows us to train embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semantic labels. Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking. We provide case studies to demonstrate that, qualitatively, image retrieval based on Graph-RISE effectively captures semantics and, compared to the state-of-the-art, differentiates nuances at levels that are closer to human-perception.

Graph-RISE: Graph-Regularized Image Semantic Embedding

TL;DR

Graph-RISE tackles the challenge of ultra-fine image semantics by reframing embedding learning as large-scale classification augmented with neural graph regularization. It introduces a graph-regularized training objective and a Graph-RISE graph that encodes image-image relationships via co-click and similar-image signals, trained on roughly 260M images and 40M labels. The approach yields substantial improvements over state-of-the-art models on kNN and triplet evaluations for ImageNet, iNaturalist, and related datasets, and qualitative results show closer alignment with human perception. This work demonstrates that combining massive-scale labeled data with graph-regularized neural networks can produce instance-level embeddings with practical benefits for search and ranking.

Abstract

Learning image representations to capture fine-grained semantics has been a challenging and important task enabling many applications such as image search and clustering. In this paper, we present Graph-Regularized Image Semantic Embedding (Graph-RISE), a large-scale neural graph learning framework that allows us to train embeddings to discriminate an unprecedented O(40M) ultra-fine-grained semantic labels. Graph-RISE outperforms state-of-the-art image embedding algorithms on several evaluation tasks, including image classification and triplet ranking. We provide case studies to demonstrate that, qualitatively, image retrieval based on Graph-RISE effectively captures semantics and, compared to the state-of-the-art, differentiates nuances at levels that are closer to human-perception.

Paper Structure

This paper contains 19 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Spectrum of image semantic similarity. We provide six image examples (two for each granularity) to illustrate the difference from coarser (left) to ultra-fine granularity (right). We refer to ultra fine-grained as "instance-level" to contrast with category-level and fine-grained semantics.
  • Figure 2: Six potential samples of image-query pairs. Each image is labeled with the corresponding textual search query.
  • Figure 3: An illustration of a graph-regularized neural network. The image similarity subgraph of a training image $x_u$ (with the ground-truth labels $y_u$) is factorized into image-image pairs, where the neighbor image $x_v$ is semantically similar to $x_u$. The training objective consists of both the supervised loss $\mathcal{L}$ and the graph regularization $\Omega$; minimizing $\Omega$ drives the distance between the embeddings of similar images---$\phi(x_u)$ and $\phi(x_v)$---to be minimized, which means the neural network is trained to encode the local structure of a graph.
  • Figure 4: An illustration of the Graph-RISE framework. Flow in red is added to enable graph regularization and required only during training. In the input layer, a labeled image is associated with one of its neighbor images, which can be either labeled or unlabeled, and then fed into the ResNet together with its neighbor image. Then, the image embeddings generated from ResNet are used to both (a) compute the cross-entropy loss and (b) graph regularization.
  • Figure 5: PIT triplet evaluation on Recall v.s. Margin.
  • ...and 2 more figures