Table of Contents
Fetching ...

Zero-Shot Learning by Convex Combination of Semantic Embeddings

Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S. Corrado, Jeffrey Dean

TL;DR

The paper addresses zero-shot learning for large-scale image classification by reframing how images map into semantic label spaces. It introduces ConSE, a simple method that converts a pretrained n0-way classifier's output into an embedding via a convex combination of semantic label vectors, using top-T predictions and cosine similarity for label extrapolation. Experiments on ImageNet show ConSE, particularly with T=10, outperforms the state-of-the-art DeViSE across multiple zero-shot benchmarks, with strong generalization to unseen classes and scalable behavior. The approach leverages existing visual classifiers and word embeddings without additional training, highlighting a robust alternative to regression-based embedding strategies."

Abstract

Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional \nway{} classification framing of image understanding, particularly in terms of the promise for zero-shot learning -- the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing \nway{} image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of the advantages associated with more complex image embedding schemes, and indeed outperforms state of the art methods on the ImageNet zero-shot learning task.

Zero-Shot Learning by Convex Combination of Semantic Embeddings

TL;DR

The paper addresses zero-shot learning for large-scale image classification by reframing how images map into semantic label spaces. It introduces ConSE, a simple method that converts a pretrained n0-way classifier's output into an embedding via a convex combination of semantic label vectors, using top-T predictions and cosine similarity for label extrapolation. Experiments on ImageNet show ConSE, particularly with T=10, outperforms the state-of-the-art DeViSE across multiple zero-shot benchmarks, with strong generalization to unseen classes and scalable behavior. The approach leverages existing visual classifiers and word embeddings without additional training, highlighting a robust alternative to regression-based embedding strategies."

Abstract

Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional \nway{} classification framing of image understanding, particularly in terms of the promise for zero-shot learning -- the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing \nway{} image classifier and a semantic word embedding model, which contains the class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of the advantages associated with more complex image embedding schemes, and indeed outperforms state of the art methods on the ImageNet zero-shot learning task.

Paper Structure

This paper contains 8 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Zero-shot test images from ImageNet, and their corresponding top 5 labels predicted by the Softmax Baseline Krizhevsky12, DeViSE devise, and ConSE($T\!=\!10$). The labels predicted by the Softmax baseline are the labels used for training, and the labels predicted by the other two models are not seen during training of the image classifiers. The correct labels are shown in blue. Examples are hand-picked to illustrate the cases that the ConSE(10) performs well, and a few failure cases.