Table of Contents
Fetching ...

Learning visual groups from co-occurrences in space and time

Phillip Isola, Daniel Zoran, Dilip Krishnan, Edward H. Adelson

TL;DR

This work addresses learning visual groupings without labels by exploiting co-occurrence statistics in space and time. It trains a Siamese CNN to predict co-occurrence between visual primitives, deriving a symmetric affinity that is clustered with spectral methods to form objects, scenes, and place categories. The approach yields competitive unsupervised object proposals, reconstructs movie scene boundaries aligned with ground truth, and discovers meaningful place categories from geospatial photo collections. Overall, it demonstrates a scalable, domain-adaptive self-supervised signal for uncovering semantic structure in visual data.

Abstract

We propose a self-supervised framework that learns to group visual entities based on their rate of co-occurrence in space and time. To model statistical dependencies between the entities, we set up a simple binary classification problem in which the goal is to predict if two visual primitives occur in the same spatial or temporal context. We apply this framework to three domains: learning patch affinities from spatial adjacency in images, learning frame affinities from temporal adjacency in videos, and learning photo affinities from geospatial proximity in image collections. We demonstrate that in each case the learned affinities uncover meaningful semantic groupings. From patch affinities we generate object proposals that are competitive with state-of-the-art supervised methods. From frame affinities we generate movie scene segmentations that correlate well with DVD chapter structure. Finally, from geospatial affinities we learn groups that relate well to semantic place categories.

Learning visual groups from co-occurrences in space and time

TL;DR

This work addresses learning visual groupings without labels by exploiting co-occurrence statistics in space and time. It trains a Siamese CNN to predict co-occurrence between visual primitives, deriving a symmetric affinity that is clustered with spectral methods to form objects, scenes, and place categories. The approach yields competitive unsupervised object proposals, reconstructs movie scene boundaries aligned with ground truth, and discovers meaningful place categories from geospatial photo collections. Overall, it demonstrates a scalable, domain-adaptive self-supervised signal for uncovering semantic structure in visual data.

Abstract

We propose a self-supervised framework that learns to group visual entities based on their rate of co-occurrence in space and time. To model statistical dependencies between the entities, we set up a simple binary classification problem in which the goal is to predict if two visual primitives occur in the same spatial or temporal context. We apply this framework to three domains: learning patch affinities from spatial adjacency in images, learning frame affinities from temporal adjacency in videos, and learning photo affinities from geospatial proximity in image collections. We demonstrate that in each case the learned affinities uncover meaningful semantic groupings. From patch affinities we generate object proposals that are competitive with state-of-the-art supervised methods. From frame affinities we generate movie scene segmentations that correlate well with DVD chapter structure. Finally, from geospatial affinities we learn groups that relate well to semantic place categories.

Paper Structure

This paper contains 13 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: We model statistical dependences in the visual world by learning to predict which visual primitives -- patches, frames, or photos -- will be likely to co-occur within the same spatial or temporal context. Above, the primitives are labeled $A$ and $B$, and the context is labeled $\mathcal{C}$. By clustering primitives that predictably co-occur, we can uncover groupings such as objects (a group of patches; left), movie scenes (a group of frames; middle), and place categories (a group of photos; right).
  • Figure 2: Overview of our approach to learning to group patches. We train a classifier to that takes two isolated patches, $A$ and $B$, and predicts $\mathcal{C}$: whether or not they were taken from nearby locations in an image. We use the output of the classifier, $P(\mathcal{C}=1|A,B)$, as an affinity measure for grouping. The rightmost panel shows our grouping strategy. We setup a graph in which nodes are image patches, and all nearby nodes are connected with an edge, weighted by the learned affinity (for clarity, only a subset of nodes and edges are shown). We then apply spectral clustering to partition this graph and thereby segment the image. The result on this image is shown in Figure \ref{['fig:object_proposals_evaluation']}.
  • Figure 3: t-SNE visualizations of the learned affinities in each domain. We construct an affinity matrix between 3000 randomly sampled primitives to create each visualization, using $w(A,B)$ as the affinity measure. We then apply t-SNE on this matrix (tsne). To avoid clutter, we visualize the embedded primitives snapped to the nearest point on a grid. The learned affinities pick up on different kinds of similarity in each domain. Patches are arranged largely according to color, while the geo-photo affinities are less dependent on color, as can be seen in the inset where day and night waterfronts map to nearby points in the t-SNE embedding.
  • Figure 4: Example object proposals. Out of 100 proposals per image, we show those that best overlap the ground truth object masks. Average best overlap (defined in krahenbuhl2014geodesic) and recall at a Jaccard index of 0.5 are superimposed over each result.
  • Figure 5: Object proposal results, evaluated on bounding boxes. Our unsupervised method (labeled "Co-occurrence") is competitive with recent supervised algorithms at proposing up to around 100 objects. ABO is the average best overlap metric from (krahenbuhl2014geodesic), $\mathcal{J}$ is Jaccard index. The papers compared to are: BING (cheng2014bing), EdgeBoxes zitnick2014edge, LPO (krahnenbuhl2015), Objectness (alexe2012measuring), GOP (krahenbuhl2014geodesic), Randomized Prim (manen2013prime), Sel. Search (uijlings2013selective).
  • ...and 2 more figures