Table of Contents
Fetching ...

Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation

Benjamin Klein, Guy Lev, Gil Sadeh, Lior Wolf

TL;DR

The paper addresses the mismatch between descriptor distributions and Gaussian assumptions in Fisher Vector representations. It introduces Laplacian (LMM) and Hybrid Gaussian-Laplacian (HGLMM) mixtures, derives EM algorithms and Fisher Vector formulations for both, and shows that per-dimension Gaussian or Laplacian choices emerge naturally in HGLMM. Applied to image-text tasks, using word2vec for text and CNN features for images with CCA alignment, HGLMM-based Fisher Vectors outperform traditional GMM-based FVs, with fusion of GMM and HGLMM yielding the best results across several benchmarks. These findings demonstrate that tailoring the probabilistic model to descriptor statistics can yield state-of-the-art cross-modal retrieval and paves the way for broader use of non-Gaussian FV variants.

Abstract

In the traditional object recognition pipeline, descriptors are densely sampled over an image, pooled into a high dimensional non-linear representation and then passed to a classifier. In recent years, Fisher Vectors have proven empirically to be the leading representation for a large variety of applications. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). Motivated by the assumption that different distributions should be applied for different datasets, we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. An interesting property of the Expectation-Maximization algorithm for the latter is that in the maximization step, each dimension in each component is chosen to be either a Gaussian or a Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks.

Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation

TL;DR

The paper addresses the mismatch between descriptor distributions and Gaussian assumptions in Fisher Vector representations. It introduces Laplacian (LMM) and Hybrid Gaussian-Laplacian (HGLMM) mixtures, derives EM algorithms and Fisher Vector formulations for both, and shows that per-dimension Gaussian or Laplacian choices emerge naturally in HGLMM. Applied to image-text tasks, using word2vec for text and CNN features for images with CCA alignment, HGLMM-based Fisher Vectors outperform traditional GMM-based FVs, with fusion of GMM and HGLMM yielding the best results across several benchmarks. These findings demonstrate that tailoring the probabilistic model to descriptor statistics can yield state-of-the-art cross-modal retrieval and paves the way for broader use of non-Gaussian FV variants.

Abstract

In the traditional object recognition pipeline, descriptors are densely sampled over an image, pooled into a high dimensional non-linear representation and then passed to a classifier. In recent years, Fisher Vectors have proven empirically to be the leading representation for a large variety of applications. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). Motivated by the assumption that different distributions should be applied for different datasets, we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. An interesting property of the Expectation-Maximization algorithm for the latter is that in the maximization step, each dimension in each component is chosen to be either a Gaussian or a Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks.

Paper Structure

This paper contains 19 sections, 24 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Shown are two examples in which GMM Fisher Vectors and HGLMM Fisher Vectors considerably differ. For the left example, GMM's rank one result was correct. The rank of the first ground truth result in the list of HGLMM was 35. In the example on the right, the corresponding ranks were 3 and 1.
  • Figure 2: A running example of our RNN on a sample from Flickr8K. The input to the network in the first step is the query image after applying the CNN transformation and then the appropriate CCA projection for images. The input in every following step $t$ is the word2vec representation of the word that was predicted in step $t-1$ after applying the $HGLMM$ fisher vector representation and then the appropriate CCA projection for sentences.
  • Figure 3: A few samples from the test set of Flickr 8k and the corresponding sentences that were generated by our RNN sentence synthesis model.